Why var is used for geographical data
Examining the relationship between this stable surrounding average and the focal AirBnB, we can even find clusters in our model error. Recalling the local Moran statistics in Chapter 7 , we can identify certain areas where our predictions of the nightly log AirBnB price tend to be significantly off:. Thus, these areas tend to be locations where our model significantly under-predicts the nightly AirBnB price both for that specific observation and observations in its immediate surroundings.
This is critical since, if we can identify how these areas are structured — if they have a consistent geography that we can model — then we might make our predictions even better, or at least not systematically mis-predict prices in some areas while correctly predicting prices in other areas. Since significant under- and over-predictions do appear to cluster in a highly structured way, we might be able to use a better model to fix the geography of our model errors.
There are many different ways that spatial structure shows up in our models, predictions, and our data, even if we do not explicitly intend to study it. Fortunately, there are nearly as many techniques, called spatial regression methods, that are designed to handle these sorts of structures. Spatial regression is about explicitly introducing space or geographical context into the statistical framework of a regression. Conceptually, we want to introduce space into our model whenever we think it plays an important role in the process we are interested in, or when space can act as a reasonable proxy for other factors we cannot but should include in our model.
As an example of the former, we can imagine how houses at the seafront are probably more expensive than those in the second row, given their better views. Spatial regression is a large field of development in the econometrics and statistics literatures.
In this brief introduction, we will consider two related but very different processes that give rise to spatial effects: spatial heterogeneity and spatial dependence. Before diving into them, we begin with another approach that introduces space in a regression model without modifying the model itself but rather creates spatially explicit independent variables.
Often, this reflects the fact that processes are not the same everywhere in the map of analysis, or that geographical information may be useful to predict our outcome of interest. We discuss spatial feature engineering extensively in Chapter 12 , though, and the depth and extent of spatial feature engineering is difficult to overstate. One relevant proximity-driven variable that could influence our San Diego model is based on the listings proximity to Balboa Park.
A common tourist destination, Balboa park is a central recreation hub for the city of San Diego, containing many museums and the San Diego zoo. Thus, it could be the case that people searching for AirBnBs in San Diego are willing to pay a premium to live closer to the park. If this were true and we omitted this from our model, we may indeed see a significant spatial pattern caused by this distance decay effect. Therefore, this is sometimes called a spatially-patterned omitted covariate : geographic information our model needs to make good predictions which we have left out of our model.
First, though, it helps to visualize the structure of this distance covariate itself:. To run a linear model that includes the additional variable of distance to the park, we add the name to the list of variables we included originally:.
When you inspect the regression diagnostics and output, you see that this covariate is not quite as helpful as we might anticipate:. It is not statistically significant at conventional significance levels, the model fit does not substantially change:. Finally, the distance to Balboa Park variable does not fit our theory about how distance to amenity should affect the price of an AirBnB; the coefficient estimate is positive , meaning that people are paying a premium to be further from the Park.
We will revisit this result later on, when we consider spatial heterogeneity and will be able to shed some light on this. Further, the next chapter is an extensive treatment of spatial fixed effects, presenting many more spatial feature engineering methods. Here, we have only showed how to include these engineered features in a standard linear modeling framework. Our approach in that case was to incorporate space through a very specific channel, that is the distance to an amenity we thought might be influencing the final price.
However, not all neighborhoods have the same house prices; some neighborhoods may be systematically more expensive than others, regardless of their proximity to Balboa Park. If this is our case, we need some way to account for the fact that each neighborhood may experience these kinds of gestalt , unique effects. One way to do this is by capturing spatial heterogeneity.
At its most basic, spatial heterogeneity means that parts of the model may vary systematically with geography, change in different places. We deal with the first two in this section. To illustrate them, let us consider the house price example from the previous section.
The rationale goes as follows. Given we are only including a few explanatory variables in the model, it is likely we are missing some important factors that play a role at determining the price at which a house is sold. Some of them, however, are likely to vary systematically over space e. If that is the case, we can control for those unobserved factors by using traditional binary variables but basing their creation on a spatial rule.
For example, let us include a binary variable for every neighborhood, indicating whether a given house is located within such area 1 or not 0. Mathematically, we are now fitting the following equation:. Programmatically, we will show two different ways we can estimate this: one, using statsmodels ; and two, with PySAL.
This package provides a formula-like API, which allows us to express the equation we wish to estimate directly:. Critically, note that the trailing -1 term means that we are fitting this model without an intercept term. This is necessary, since including an intercept term alongside unique means for every neighborhood would make the underlying system of equations underspecified. Using this expression, we can estimate the unique effects of each neighborhood, fitting the model in statsmodels note how the specification of the model, formula and data, is separated from the fitting step :.
We could rely on the summary2 method to print a similar summary report from the regression but, given it is a lengthy one in this case, we will illustrate how you can extract the spatial fixed effects into a table for display. The approach above shows how spatial FE are a particular case of a linear regression with a categorical variable. Neighborhood membership is modeled using binary dummy variables. Thanks to the formula grammar used in statsmodels , we can express the model abstractly, and Python parses it, appropriately creating binary variables as required.
This framework allows the user to specify which variables are to be estimated separately for each group. In this case, instead of describing the model in a formula, we need to pass each element of the model as separate arguments.
Similarly as above, we could rely on the summary attribute to print a report with all the results computed. For simplicity here, we will only confirm that, to the 12th decimal, the parameters estimated are indeed the same as those we get from statsmodels :.
Econometrically speaking, what the neighborhood FEs we have introduced imply is that, instead of comparing all house prices across San Diego as equal, we only derive variation from within each postcode. By including a single variable for each area, we are effectively forcing the model to compare as equal only house prices that share the same value for each variable; or, in other words, only houses located within the same area.
Introducing FE affords a higher degree of isolation of the effects of the variables we introduce in the model because we can control for unobserved effects that align spatially with the distribution of the FE introduced by neighborhood, in our case. To make a map of neighborhood fixed effects, we need to process the results from our model slightly.
Then, we need to extract just the neighborhood name from the index of this Series. A simple way to do this is to strip all the characters that come before and after our neighborhood names:. These allow us to join it to an auxillary file with neighborhood boundaries that is indexed on the same names.
We can see a clear spatial structure in the SFE estimates. The most expensive neighborhoods tend to be located nearby the coast, while the cheapest ones are more inland. At the core of estimating spatial FEs is the idea that, instead of assuming the dependent variable behaves uniformly over space, there are systematic effects following a geographical pattern that affect its behavior.
In other words, spatial FEs introduce econometrically the notion of spatial heterogeneity. They do this in the simplest possible form: by allowing the constant term to vary geographically.
The other elements of the regression are left untouched and hence apply uniformly across space. The idea of spatial regimes SRs is to generalize the spatial FE approach to allow not only the constant term to vary but also any other explanatory variable.
This implies that the equation we will be estimating is:. The result can be explored and interpreted similarly to the previous ones. If you inspect the summary attribute, you will find the parameters for each variable mostly conform to what you would expect, across both regimes.
To compare them, we can plot them side by side on a bespoke table:. An interesting question arises around the relevance of the regimes. Are estimates for each variable across regimes statistically different? For this, the model object also calculates for us what is called a Chow test. This is a statistic that tests the null hypothesis that estimates from different regimes are undistinguishable.
If we reject the null, we have evidence suggesting the regimes actually make a difference. Results from the Chow test are available on the summary attribute, or we can extract them directly from the model object, which we will do here.
There are two types of Chow test. First is a global one that jointly tests for differences between the two regimes:. The first value represents the statistic, while the second one captures the p-value.
In this case, the two regimes are statistically different from each other. The next step then is to check to whether each of the coefficients in our model differ across regimes.
For this, we can pull them out into a table:. As we can see in the table, most variables do indeed differ across regimes, statistically speaking. This points to systematic differences in the data generating processes across spatial regimes.
As we have just discussed, SH is about effects of phenomena that are explicitly linked to geography and that hence cause spatial variation and clustering. This encompasses many of the kinds of spatial effects we may be interested in when we fit linear regressions. However, in other cases, our focus is on the effect of the spatial configuration of the observations, and the extent to which that has an effect on the outcome we are considering.
For example, we might think that the price of a house not only depends on whether it is a townhouse or an apartment, but also on whether it is surrounded by many more townhouses than skyscrapers with more apartments. To the extent these two different spatial configurations enter differently the house price determination process, we will be interested in capturing not only the characteristics of a house, but also of its surrounding ones.
This kind of spatial effect is fundamentally different from SH in that is it not related to inherent characteristics of the geography but relates to the characteristics of the observations in our dataset and, specially, to their spatial arrangement. We call this phenomenon by which the values of observations are related to each other through distance spatial dependence [ Ans88 ].
There are several ways to introduce spatial dependence in an econometric framework, with varying degrees of econometric sophistication see [ Ans02 ] for a good overview. In this section, we consider three ways in which spatial dependence, through spatial weights matrices, can be incorporated in a regression framework. Let us come back to the house price example we have been working with. So far, we have hypothesized that the price of a house rented in San Diego through AirBnB can be explained using information about its own characteristics as well as some relating to its location such as the neighborhood or the distance to the main park in the city.
However, it is also reasonable to think that prospective renters care about the set of neighbours a house has, not only about the house itself, and would be willing to pay more for a house that was surrounded by certain types of houses, and less if it was located in the middle of other types. How could we test this idea? When it comes to regression, the most straightforward way to introduce spatial dependence between the observations in the data is by considering not only a given explanatory variable, but also its spatial lag.
Conceptually, this approach falls more within the area of spatial feature engineering, which embeds space in a model through the explanatory variables it uses rather than the functional form of the model, and which we delve into with more detail in Chapter But we think it is interesting to discuss it in this context for two reasons.
And second, because it also illustrates how many of the techniques we cover in Chapter 12 can be embedded in a regression model and, by extension, in other predictive approaches.
This addition implies we are also including as explanatory factor of the price of a given house the proportion neighboring houses in each type. Mathematically, this implies estimating the following model:. This can be conceptualized in two ways. This is useful and simple. But this interpretation blurs where this change might occur. This focal site will not be strongly affected if a neighbor changes by a single unit, since each site only contributes a small amount to the lag at the focal site.
Alternatively, consider a site with only one neighbor: its lag will change by exactly the amount its sole neighbor changes. We will discuss this in the following section. Once computed, we can run the model using OLS estimation because, in this context, the spatial lags included do not violate any of the assumptions OLS relies on they are essentially additional exogenous variables :.
As in the previous cases, printing the summary attribute of the model object would show a full report table. The variables we included in the original regression display similar behavior, albeit with small changes in size, and can be interpreted also in a similar way.
To focus on the aspects that differ from the previous models here, we will only pull out results for the variables for which we also included their spatial lags:.
More relevant to this section, any given house surrounded by condominiums also receives a price premium. Similar interpretations can be derived for all other spatially lagged variables to derive the indirect effect of a change in the spatial lag. However, it is interesting to consider this would not be the case for many other kinds of weights like Kernel , Queen , Rook , DistanceBand , or Voronoi , where each observation has potentially a different number of neighbors.
To illustrate the effect of a change in one of the values in a given location in other locations, we will switch one of the properties into the condominium category.
Consider the third observation, which is the first apartment in the data:. Now, our new prediction in the scenario where we have changed site 2 from an apartment into a condominium , can be computed by translating the model equation into Python code and plugging into it the simulated values we have just created:. Note the only difference between this set of predictions and the one in the original m6 model is that we have switched site 2 from apartment into condominium.
Hence, every property which is not connected to site 2 or is not site 2 itself will be unaffected. The neighbors of site 2 however will have different predictions. Now, the effect of changing site 2 from an apartment into a condominium is associated with the following changes to the predicted log price, which we calculate by substracting the new predicted values from the original ones and subsetting only to site 2 and its neighbors:.
Introducing a spatial lag of an explanatory variable, as we have just seen, is the most straightforward way of incorporating the notion of spatial dependence in a linear regression framework. It does not require additional changes, it can be estimated with OLS, and the interpretation is rather similar to interpreting non-spatial variables, so long as aggregate changes are required.
The field of spatial econometrics however is a much broader one and has produced over the last decades many techniques to deal with spatial effects and spatial dependence in different ways. Although this might be an over simplification, one can say that most of such efforts for the case of a single cross-section are focused on two main variations: the spatial lag and the spatial error model.
Both are similar to the case we have seen in that they are based on the introduction of a spatial lag, but they differ in the component of the model they modify and affect. Although it appears similar, this specification violates the assumptions about the error term in a classical OLS model.
Hence, alternative estimation methods are required. PySAL incorporates functionality to estimate several of the most advanced techniques developed by the literature on spatial econometrics. For example, we can use a general method of moments that account for heteroskedasticity [ ADKP10 ] :. Similarly as before, the summary attribute will return a full-featured table of results. For the most part, it may be interpreted in similar ways to those above.
Visual elements have several properties that can be used to transmit information. Depending on the case, some of them might more suitable than others.
Figure From left to right: position, shape, size, hue, value, texture and orientation. These properties are known as visual variables and are applied to the geometric elements used to visualize geographical information. Those elements can be differentiated using the following visual variables, which are shown in figure The use of position is rather restricted in the case of a map, since the real position of the element to be rendered should be respected.
It is seldom used. The shape is defined by the perimeter of the object. This variable is mostly used in the case of point data, using a symbol of a given shape located at the exact coordinates of the point to be rendered.
It is difficult to apply to linear symbols and in the case of areal symbols it requires altering the shape of the symbol itself. Size indicates the dimensions of the symbol. In the case of points, it can be applied by changing the size of the symbol itself. In the case of lines, changing their thicknesses is the most usual way of applying this visual variable on them.
It is not used in areal symbols, except in the case of using a texture fill, in which the size variable is applied to the texture and not to the symbol itself.
Size alters how other visual variables are perceived , especially in the case of small sizes. Texture refers to the pattern used to fill the body of the symbol. It can be applied to lines, using dash patterns, but it is mostly applied to areal symbols. Color is the most important of all visual variables.
Two of its components can be used as individual visual variables themselves: hue and value. Hue is what we usually call color. That is, the name of the color blue, red, green, etc. Hue can be altered by the hue of surrounding elements , especially in small symbols. Although human perception has a great sensitivity, it might be difficult to identify in small symbols, and it can be wrongly identified if the symbol has other larger ones with different hues in its surroundings.
Value defines the darkness of the color. For instance, light blue and dark blue have the same hue, but they have different value. Differentiating two symbols by their value can be difficult depending on the type of symbol. It is easier in the case of areal symbols, while in the case of linear and point symbols it depends on their size. Smaller sizes make it more difficult to compare values and to extract the information that the visual variable is trying to convey.
Orientation is applied to point symbols, unless they have some sort of symetry that makes it difficult to identify the orientation of the symbol. For areal symbols, it is applied to their texture. It's not applied in the case of linear symbols. A visual variable is said to be associative if, when applied, doesnt change the visibility of an element. That is, it's not possible to give more importance to an element using that visual variable.
A visual variable is said to be selective if, when applied, generates different categories of symbols. A visual variable is said to be selective if it can be used to represent a given ordering.
When, apart from being ordered, it can be used to express ratios. In the above list, variables are ordered according to the so-called levels of organization.
The associative property is at the lower level, while the quantitative one is at the highest. The level of organization of visual variables is relevant when combining them, as we will see later. Also, the level of organization of a variable defines the type of information that the variable can transmit. Starting with the associative property, we see that, except for size and value, all other visual variables do not do not emphasize one element over the others.
In other words, one element is not seen as more important than the rest of them when the visual variable is texture, color, shape or position. With size, however, it is clear that a larger one gives symbols a more prominent role. In the same way, a darker value attracts the attention of the observer much more than a color with a lighter value. Regarding the selective property, we can say that a variable has a selective quality if, at a quick glance, we can easily identify the elements that belong to a given group which is defined by a visual variable.
The clearest example of this is hue. We can quickly separate from a set of symbols those that are red or yellow. All visual variables, excepting shape, have this property, although it might not be so as in the case of hue. Shape does not make elements form groups spontaneously. The ordered property is found in those visual variables that we can use to define an ordering. Only position, texture, size and value are ordering properties. For instance, in the image corresponding to the visual variable hue, we cannot say which element we would place at the beginning or end of a scale defined by hue itself.
With value, however, we can, since that scale would range between the lighter tones to the darker ones, and we can visually differentiate and sort them. Finally, the quantitative property is found in those visual variables that can be used to visually estimate quantities and ratios. Only position and size have it.
For instance, we can see that the big circles in the image corresponding to the size visual variable are more or less twice the size of the smaller ones. Visual variables can be combined for instance, representing objects with different size and hue.
The properties of all the visual variables that are used must be considered, and if a given property is needed for the information that we want to convey, all those visual variables should have it. The perception of visual variables might be altered by the environment. It is important to study this from two points of view: perceptual constancy how much we can modify visual elements and their surroundings before they fail to to convey the same information and can be misidentified and perceptive aids how we can help visual elements to be perceived exactly in the way that we want.
Perceptual constancy defines how objects are perceived in the same way regardless of the changes in the environment. For instance, if an object is round, such as a wheel, it will have a round shape when we look at it from a perpendicular direction. If we now look at it from a different angle, we will see an ellipse instead of a circle. However, we will read it as round and will still identify its shape correctly. That is an example of the perceptual constancy of the shape.
Not all visual variables have such a perceptual constancy. When the perception of an element changes even if the object itself does not, a perceptual contrast is said to exist. Perceptual contrast might cause a visual element to be wrongly perceived and the information that it transmits to be misinterpreted. The following are some of the main ideas about perceptual contrasts to take into account when creating a map:. Size is the visual variable that is more affected by perceptual contrasts.
The apparent size of an object might change if it is surrounded by other elements of a different size. This is particularly relevant when using point symbols in a map. Values is also altered when other elements with a different value appear nearby, specially if there are a large number of them. Hue is altered by the presence of other hues.
In a map, we should consider how the background color might affect the foreground symbols. Complementary hues, when put together, might cause a vibration sensation in the border between them. Regarding perception aids, the most important factor when creating a map is the correct separation between the foreground objects and the background. The properties of the visual variables must be used to create different levels in the visualization, assigning more relevance to some elements in order to focus the attention on the information that they transmit.
To make certain layers the most relevant ones for the purpose of the map more visible, a correct hierarchy must be established with the help of visual variables. This hierarchy will add depth to the information displayed in the map, and some elements will be perceived as being more important than others. Layer ordering already defines a structure and a hierarchy, but that is not enough in most cases and visual variables should be used to reinforce it.
Maps are a method of communication that uses a language with a particular purpose: describing spatial relations. A map is, therefore, a symbolic abstraction of a real-word phenomenon, which implies that it has some degree of simplification and generalization.
The visual language that we have just seen becomes a cartographic language when it is adapted to the particular case of creating maps and knowing its rules is needed to create cartography that is later useful for the map user. All these ideas related to map production form what is known as cartographic design. Cartographic design involves making decisions in this case, by the GIS user who takes the role of the cartographer.
These decisions must be guided by the purpose of the map and the target audience and depending on these factors, the cartographer must decide the projection which doesn not always have to be the original one of the data , the scale depending on the level of detail and taking into account the limitations of the data , the type of map we will see more about this later in this chapter , or the symbols to use, among other things. There are two main types of cartography: base cartography also called fundamental or topographic and thematic cartography.
Historically, base cartography represents the classic maps that have been created by cartographers. This type of map serves the purpose of precisely describing what is on the surface of the Earth. Thematic cartography focuses on displaying information about a given phenomenon a given geographical variable , which can be of any type: physical, social, political, cultural, etc.
We exclude from this list those phenomena that are purely topographic, which are the subject matter of base cartography. We can also say that base cartography represents physical elements a stream, a coast line, a road, a valley, etc. Thematic cartography uses base cartography usually included in thematic maps to help the map user to understand the spatial behavior of the variable being represented, and also to provide a geographical context for it.
We already know that the thematic component of geographical information can be numeric or alphanumeric and that numeric variables can be nominal, ordinal, intervals, or ratios. Selecting a correct symbology according to the type of information that we are working with is key to producing an effective map.
In particular, we must use a visual variable that has the correct properties levels of organization for the variable that we want to visualize. For instance, the associative property and the selective property are of interest just for qualitative information, while size is the only visual variable that we can use that has the quantitative property and therefore, the only one that should be used to represent ratios.
The following are some of the more important ideas about this, referred to the aforementioned types of information. Nominal information is correctly represented using the visual variable shape. This information shows what is found in the different locations of a map, and not how much is found, and it is more related to base cartography than to thematic cartography. Using different symbols for point elements and line elements is a common and very effective solution.
For the case of areal symbols, hue and texture are the most common solutions. Alphanumeric information has similar properties, and the same ideas apply to it. Since values of the variable define an order, a visual variable with the ordered property is needed to correctly visualize this type of information Interval and ratio. Visual variables with the ordered property can be used in this case.
However, size is a better choice, as it is the only one which has the quantitative property. Values are normally grouped into classes so the same value of the visual variable same size of the symbols or same color value, for instance is used for different values of the variable that we are visualizing. There are different strategies for this, which try to maximize the information that the map transmits. The most commons ones are equal intervals, intervals using percentiles or natural intervals intervals that try to minimize the variance within each class.
Using one or another of these methods can have a noticeable effect in the visualization, as is shown in figure It is important to remark that, although levels of organization indicate increasing potential that is, with a variable such as size or value we can convey all the information that can be conveyed with hue, since they have properties with a higher level , it is not always better to use visual variables with a higher level of organization , and it is not true that they will always be better than those with a lower one.
A map is not just the part that represents the geographical information, but a set of multiple elements, for example, the one that contains the geographical information itself. A correct layout of the map elements is as important as a correct symbology, since these, like symbology itself, are designed to help the map user to better interpret the information that it contains.
The following are the main elements that can be used to compose a map Figure Name or title. Needed to know what information is contained in the map.
Creator the map. Additional information about the map. For instance, the coordinate reference system used or its creation data, among others. Data frame. The frame which contains the rendered geographical information. It is the central element and will use most of the space of the map. On top of the data frame, it locates the content of the map on the Earth and provides a geographical reference. It serves the same purpose as the scale, helping to estimate distances.
It is usually added at all scales, but it is more relevant in the case of small scales. When designing a map, we should try to use a symbology that is as expressive as possible. However, sometimes it is not possible to include all the information with just the symbology itself and a legend is required.
The legend has to be clear and easy to interpret as well. A legend that is too large or difficult to understand is probably telling us that the symbology that we have selected can be improved.
0コメント