Monday, November 16, 2015

Lab 12 - Geographically Weighted Regression

This lab is somewhat of an extension of last week's, which dealt with Ordinary Least Squares (OLS) regression. OLS regression is a numeric or non-spatial analysis that predicts values of a dependent variable based on independent, or explanatory, variables. The main thing is that with OLS regression, all the values are given the same weight when predicting the dependent variable's values. This week with Geographically Weighted Regression (GWR), we learned how to perform the same type of analysis, but now the analysis is weighted in that areas that are spatially closer to a particular point for which we are trying to predict are given more weight than areas farther away. GWR allows us to analyze how the relationships between the explanatory variables and the dependent variables change over space.

The first part of the lab has us using the data from last week's lab on OLS and performing a GWR analysis using the same dependent and independent variables. We were then to compare the two analyses to determine which was better. In this case, GWR gave a higher adjusted R-squared value and a lower AIC value, so GWR was better.

In the second part of the lab, we were to complete a regression analysis (both OLS and GWR) for crime data. For the analyses, I needed to select a type of high volume crime, so I chose auto theft. After joining the auto theft and the census tract fields, I added a new Rate field, as crime data statistics are investigated using rates instead of raw counts. I created a correlation matrix with Escel using the 5 independent variables in the lab:
BLACK_PER - Residents of black race as a % of total population
HISP_PER - Residents of Hispanic ethnicity as a % of total population
RENT_PER - Renter occupied housing units as a % of total housing units
MED_INCOME - Median household income ($)
HU_VALUE - Median value of owner occupied housing units ($)

These coefficients are basically a measure of the strength of the linear relationship between that independent variable and our dependent variable, in this case the rate of auto thefts per 10,000 people. Based on the correlation matrix, I chose to use median income, renter occupied housing, and black population percentage as my 3 independent variables. I did not use variables where the correlation coefficient was very nearly 0. I also did not use both variables with negative coefficients, as they were strongly correlated with each other, so I used only the variable that was more strongly negatively correlated with the dependent variable. Using ArcGIS, I performs an OLS analysis and used Moran's I to determine the spatial autocorrelation between the residuals. I ran into an issue here where based on Moran's I, the data was clustered (z-value of over 9). Often that is a sign that not enough independent variables were included in the analysis, so I ran it a couple more times using more and fewer independent variables, but the clustering remained, so I went with my original 3 independent variables based on both the correlation matrix and the statistically significant p-values. The adjusted R-squared value here is much lower than what I am used to seeing from last week and the first part of this lab (0.220) and the OLS had an AIC of 2514.415.
I then performed a GWR analysis using the same dependent and independent variables. There was an improvement in the adjusted R-squared (0.300) and AIC (2471.306) values, but the most dramatic improvement was when I performed Moran's I and found a z-score for residuals of -1.073, which means the data is no longer clustered and is now randomly distributed. I think the adjusted R-squared values are so low because none of the independent variables showed a really strong correlation to the auto theft rate (most of the values were around 0.4 or 0.5). If there isn't a strong correlation, there must be another factor not included in the analysis leading to the auto theft rate.
I'm not necessarily sure that either analysis is the best for predicting the rate of auto thefts, primarily due to the adjusted R-squared values. The best I could manage was 0.300, which means that 30% of the spatial variation in auto thefts per 10,000 people can be explained by the 3 independent variables used. The results were that in the center of the map the observed values of auto theft were higher than the predicted values (over 2.5 standard deviations in a couple of spots), and observed values were slightly lower than predicted a little further west and north of center. Over the rest of the map, the observed and predicted values were very similar. A better result might come from using different or more independent variables (perhaps something similar to distance from urban centers such as in last week's lab), or from a different type of analysis.

Overall, I feel I have a better understanding of what these regression techniques are telling us, and a better idea of what to do to get better results. I feel I understand GWR better than OLS for some reason; maybe it's that the concept of closer objects being more related than objects farther away, which is the basic premise behind GWR, makes sense to me.

No comments:

Post a Comment