Assessing the Illusion of Independence: How spatial econometrics and network analysis can strengthen
Manuel S. González Canché
HIGHER EDUCATION is a complex field of study encompassing areas of research as diverse as student affairs, student financial aid, policy and network analysis politics, state and federal financing, sector effects, can strengthen comparative studies, and student migration, as well as student college choice, access, persistence, and success. The common denominator that must guide all research on higher education issues is one that is based on the desire to reach a rigorous and robust understanding of the best decision-making strategies that positively influence the likelihood of success or improve future prospects of the units of analyses (i.e., students, faculty, institutions, sectors, state financing, loan repayment, and decreased debt burden). Regrettably, it seems that the prevalent characteristic surrounding decision making regarding higher education policy and practice is a lack of support based on research findings.
There are at least two factors that may be driving this disconnect between decision-making and research findings. The first is the typical time-lag between research paper submission, peer-reviewed evaluation, acceptance, and eventual publication. The second factor relates to the focus of higher education research, which tends to study the results associated with policy and practice decisions rather than to guide these decisions in the first place. This situation makes the distinction between research and evaluation difficult to separate in many instances, a discussion that goes well beyond this brief essay.
A point worth noting is that regardless of the purpose of research on higher education (i.e., informing and assessing decision processes that affect different actors or analyzing programs and policies), researchers should always try to obtain the strongest possible evidence to help improve our understanding of the issue under study. While qualitative research is truly needed and valued, the current essay focuses on the role of quantitative analysis in higher education research and particularly highlights spatial dependence issues and their not-so-apparent relationship with network analysis.
The illusion of independence among units of analysis (e.g., students, institutions, states) is at the heart of traditional statistical models. This assumption that units are not affected by their immediate contexts, peers, or both more often than not leads researchers to make inferential claims based on biased estimations. Accordingly, this discussion is timely and relevant because lack of independence among units of analysis based on spatial proximity is one of the least frequently addressed problems known to render biased results.
Indeed, these biased results can lead to (a) the implementation of programs, strategies, and policies that may potentially have hurtful and/or unexpected consequences, or (b) the inaccurate assessment of the effects of previously implemented programs. In this view, the overt assessment of spatial dependence issues and their broader implementation in higher education represents a unique opportunity to strengthen the inferences and analyses conducted in this applied field of study.
The purpose of this essay is threefold. First, it calls attention to the ways in which the lack-of-independence issues may lead to upward (i.e., more pronounced than in reality) or downward (i.e., less pronounced than in reality) estimations of factors influencing variation in the outcomes of interest. The second purpose is to discuss two closely interrelated yet thus far disconnected perspectives that enable testing and, if necessary, correcting for this lack of independence (i.e., network analysis and spatial econometrics/geostatistics). The third purpose is to discuss two studies in which issues of lack of independence have been addressed in higher education research relying on analytic techniques that are yet to be broadly implemented in higher education research.
What is “Spatial Dependence” and why should we be concerned about it?
IN 1970, WALDO TOBLER stated that “everything is related to everything else, but near things are more related than distant things” (Tobler, 1970, p. 236). This statement is now known as the first law of geography and straightforwardly captures the notion of spatial dependence. A real-life situation in which spatial dependence is observed is in the price-setting of consumer products, such as gasoline. To exemplify this scenario let us assume that there are three gas stations located close to one another, as shown in Figure 1 below. For the sake of simplicity, let us focus on gas stations A and B. In this scenario, customers will tend to select the less expensive option, assuming that the quality of their product is similar. Figure 1 shows that the tan gasoline station (A) charges less and has a similar quality to its closest competitor (B); consequently, station A is selected more often than station B, yet both gas stations actively affect the prices that its competitor can charge. That is, if station B decreases its prices, station A would be forced to adjust its prices accordingly in order to remain a competitive option and vice versa.
The issue of spatial dependence becomes relevant when peer institutions (e.g., gas stations) influence the variation of the outcome of interest (the price of gasoline). For instance, let us assume that we are interested in modeling the factors driving variation in gas prices but we rely on regression models that assume independence of gas prices across gas stations. In this case, we would ignore the fact that the gas prices charged by gas stations located in close proximity to one another are not independent, and that the gas prices they charge will tend to covary.
Conceptually speaking, then, the main issue resulting from spatial dependence is that units of analysis may appear to have better or worse outcomes than in reality. Extrapolating the spatial dependence idea to higher education tuition price-setting, one can argue that if an institution of higher education i (IHEi) is the neighbor of IHEj and IHEj performs extremely well in charging higher tuition amounts, spatial autocorrelation will lead researchers to estimate better outcomes for IHEi, regardless of this institution's actual abilities to charge higher or lower tuition prices, therefore reaching biased estimates. This idea has been applied to research on the higher education tuition price-setting process in four recent studies ( González Canché, 2014, 2016a, 2016b; McMillen et al., 2007). In all four cases, the authors provided evidence that tuition prices (McMillen et al., 2007) charged to non residential students across different types of institutions ( González Canché, 2014, 2016a) depend on the prices charged by the closest neighboring institutions. Accordingly, these studies relied on spatial econometric approaches to address this 'spatial dependence' problem.
In sum, spatial dependence of model outcomes represents an important limitation faced by naïve regression models, which effectively ignore the extent to which the outcomes of a unit of analysis (e.g., institution in this study) will be influenced by its neighboring institutions’ outcomes (and vice versa). Accordingly, the assumption that units’ or institutions’ outcomes are geographically independent constitutes a serious violation of one of the fundamental assumptions of standard regression models (Bivand, Pebesma, & Gómez-Rubio, 2013; Cressie, 2015; Schabenberger & Gotway, 2004). Given the availability of data that can be geocoded, (or, in the case of IPEDS, are already geocoded) research in higher education must begin testing and correcting if needed for spatial dependence of outcome variables before final model specification.
The link between Spatial Dependence and Network Analysis
IT IS WORTH NOTING that the notions of “closeness” and dependence have been successfully identified and applied to temporal data wherein researchers know that values measured close together in time are more similar than values measured further apart in time above and beyond random chance. This issue is referred to as serial autocorrelation and implies that a given outcome Y is likely to have more in common (depend or covary) with the immediately preceding observation (Yt-1) than with the observation measured 10 time units previously (Yt-10), for example. The measurement of time can be recorded in days or months, but in higher education research it is typically documented in years.
When it comes to spatial data, however, the notion of spatial autocorrelation is considerably less frequently used in higher education research. Interestingly, the rationale is similar to the idea of closeness in time. More specifically, contemporaneous spatially auto correlated observations (i.e., observations recorded in the same period of time and in close spatial proximity) are likely to be more similar than values measured farther away from each other in space.
Going back to Figure 1, the gas price charged by gas station A is expected to be more closely related to the gas price of its closest neighbor, gas station B, which is located within a shorter distance (i.e., within 0.25 miles), than the gas price at gas station C, which is situated farther away (i.e., 0.75 miles away). Indeed, the notion of distance can be broadened to include emotional, affective, or adscription measures (such as taking the same class, belonging to the same consortium, etc.), as discussed below.
Although the details of model specification are beyond the scope of this essay, it is important to note that, conceptually speaking, the information contained in Figure 1 can be represented as a matrix of spatial dependence in which the row and column intersection between two units of analysis would have number one if a given condition is met and a zero otherwise. (This same matrix may also contain values greater than 1, representing some form of strength of relationship. For the purposes of this essay, we will only deal with dichotomic (1,0) links.) The criterion used to create the arrows shown in Figure 2 is referred to as 1-k neighboring specification, in which a connection is only established between a particular gas station and the station that is within its closest proximity.
This 1-k criterion has directionality. For example, gas stations A and B are connected with a bidirectional arrow because, for both stations, the other is their closest unit. In the case of gas station C, the arrow goes from C to B, but there is no arrow from B to C given that for B the closest station is A. It is also evident that there is no link between gas stations A and C. The matrix representation of this figure is show in the following array:
Gas Station A
|Gas Station B
||Gas Station C
|Gas Station A
|Gas Station B
|Gas Station C
Note that the diagonal of the array accounts for self-selection, which in the case of spatial analysis is not allowed. Note further that the link between row B and column C is 0, but the intersection between row C and column B contains a 1. This is because in the matrix of influence, rows are assumed to “send” links to columns. In this case, gas station C is “sending” a link to its closest neighbor B, but gas station B is not returning this connection given that its closest neighbor is station A, not station C. Once this matrix has been defined and operationalized, the analyst can easily test for spatial dependence based on proximity, wherein the outcome variables of units that have nonzero intersecting cells will be tested for dependence.
In this respect, it is worth noting that network analysis principles follow the same procedure in terms of matrix representation to create sociograms or visualizations of units’ connections in the network as shown in Figure 2. An important characteristic of network analysis is that the only source of information used to create Figure 2 is shown in the array. That is, while distance between units was used to create the link or connections, this distance measure is no longer used to plot the network representation. Rather the network-analysis-layout-algorithm employed detected that gas stations A and B must be “closer” to one another because they selected each other, and that station C should be placed farther away given that station C was not selected by either A or B.
The most important difference between network analysis and spatial econometrics is the definition of the matrix of influence. In the latter, this definition is based on physical distance between units, while in network analysis it is based on social links, such as friendship, taking classes together, ascription to a reading club, being suspended together, or enrolling in the same higher education institution, to mention some examples. Nonetheless, as depicted in Figure 2, both procedures can be easily merged to render similar results.
From a practical point of view, it is important to note that the set of analytic techniques that are available to conduct spatial econometric analyses are transferable to the analysis of networks, which truly leads to the possibility of accounting for dependence issues in analysis of units associated among themselves. This notion was implemented in a recent study of community college students’ credit-taking patterns where González Canché and Rios-Aguilar (2015) tested whether credit accumulation was associated with the average credit accumulation attained by classmates in a large community college located in California. The matrix of influence was established following the same rationale used to build the one represented in the array shown above. In this case, a student had a connection with another student if she or he took a class with this student during the two years of individual panel data collection. The modeling approach we employed in study allowed for the empirical testing of peer effects in community colleges and allowed for inferences that were robust to lack of independence issues. Notably, we found that peer effects are particularly strong for underrepresented minority men attending community college. For underrepresented minority female students, peer effects were not significantly associated with their credit accumulation because, at least in this community college, minority female students accumulated more credits than everyone else in the “network.”
In another recent study, I applied network principles to capture student migration patterns at the population level, demonstrating the spatial dependence of these patterns across neighboring states and assessing the extent to which these mobility patterns were related to tuition-setting behaviors across different institutional sectors within a given state. I arrived at the conclusion that since both states’ abilities in attracting students and the institutions’ tuition-setting behaviors are affected by geographic location, future studies should consider implementing methods that account for spatial dependence before conducting final model estimation.
These two examples help to justify the need for more studies that can be conducted employing analytic techniques to investigate the lack of independence issues in higher education settings. The possibility of accounting for dependence is important given that, as highlighted at the beginning of this essay, many areas of research in higher education to a great extent continue to assume independence of the units of analysis. Such an assumption should no longer be valid per se given the availability of analytic techniques capable of testing and, if necessary, correcting for dependence issues based on spatial or social proximity
In returning to the purpose of this essay, it is clear that units of analysis more often than not are not independent from one another and that their level of interaction (based on spatial proximity, friendship, membership to a common organization) most likely influences these unit’s outcomes. Fortunately, the rather problematic independence assumption can and should be tested relying on analytic techniques designed to address the bias associated with dependence among units of analysis. In this view the overarching goal of this essay was to highlight the general mechanisms through which spatial econometrics and network analysis can be brought together to assess for and if necessary correct for spatial and non-spatial dependence before final model estimation in hopes to reach more accurate depictions of factors affecting the outcomes of that interest. Nonetheless, recall that although spatial and network analysis represent a window of opportunity to strengthen researchers’ estimations, the use of sophisticated and rather complex methods simply for the sake of using “fancy methods” is useless in an applied field. In this vein, if after assessing for dependence issues, results show no need for its correction, researchers acceptable. should always go with the simplest model. Conversely, estimating the simplest model without testing for potential bias should no longer be acceptable.
In closing, it is worth noting that the incorporation of analytic techniques such as spatial dependence and network analysis to an applied field such as higher education should always go beyond mathematical strength and should ultimately aim for the improvement of our understanding of the phenomenon under study, thus revealing the best possible actions to be taken to impact positively on the lives of participants.
Bivand, R., Pebesma, E., & Gómez-Rubio, V. (2013). Applied spatial data analysis with r (vol. 747248717). New York: Springer.
Cressie, N.A.C. (1993). Front matter, in statistics for spatial data (rev. ed.). Hoboken, NJ: John Wiley & Sons, Inc. doi: 10.1002/9781119115151.fmatter.
González Canché, M.S. & Rios-Aguilar, C. (2015). Critical social network analysis in community colleges: Peer effects and credit attainment. New Directions for Institutional Research, 163, 75-91, doi: 10.1002/ir.20087.
González Canché, M.S. (2014). Price-setting and the localized non-resident student market. Economics of Education Review, 43, 21-35.
González Canché, M.S. (2016a). Geographical network analysis and spatial econometrics as tools to enhance our understanding of student migration patterns and benefits in the U.S. higher education network, Review of Higher Education.
González Canché, M.S. (2016b). The Heterogeneous nonresident student body: Measuring the effect of out-of-state students’ home-state wealth on tuition and fee price variations. Research in Higher Education, 1-43. doi: 10.1007/ s11162-016-9422-2.
McMillen, D., Singell Jr, L., & Waddell, G. (2007). Spatial competiation and the price of college. Economic Inquiry, 45(4), 817-833.
Schabenberger, O., & Gotway, C.A. (2004). Statistical methods for spatial data analysis. New York: Chapman and Hall/CRC Press.
Tobler, W.R. (1970). A computer model simulating urban growth in the Detroit region. Economic Geography, 46, 234-240.
About the Author
MANUEL GONZÁLEZ CANCHÉ, assistant professor of higher education, joined the faculty of the Institute of Higher Education in 2012, immediately after graduating from the Center for the Study of Higher Education at the University of Arizona. Earlier in his educational career, he earned a bachelor’s degree in educational research and a master’s degree in higher education and quantitative methods from esteemed universities in Mexico, his home country.
Gonzáles Canché relies on the use of quantitative analytic techniques to address what he considers topics with clear policy implications in higher education. His research follows two different, yet interconnected paths. The first can be broadly classified into issues of access, persistence and success, with emphasis on institutional sector effects on student’s outcomes. The second focuses on higher education finance, with emphasis on spatial modeling and student migration.
Gonzáles Canché’s research employs data-visualization methods, including geographical information systems, representation of real-world social networks, and text mining techniques. In related work, he aims to harness the mathematical power of network analysis to find structure in written content and is proposing an analytic method (Network Analysis of Qualitative Data) that blends quantitative, mathematical and qualitative principles to analyze text data – an approach yet to be broadly implemented in education research.
As a first-generation college student and graduate himself, Gonzáles Canché has a special research interest in factors and policies enhancing underrepresented students’ opportunities for educational success. His findings challenge traditional ideas about the negative impacts of community college enrollment on subsequent educational attainments.
He has secured funding for research from the Spencer Foundation, the American Education Research Association/National Science Foundation, the Association for Institutional Research and the Institute of Education Sciences.