Sunday, November 8, 2015

Lab 11 - Multivariate Regression, Diagnostics and Regression in ArcGIS

In this lab we learned to perform multivariate regression analysis and to work with more advanced diagnostics in ArcGIS. In the first two parts of the lab, we used Excel to perform regression analysis and diagnostics, including choosing the best model for a given dataset. In the last two parts, we performed a regression analysis to test what factors cause the high volume of 911 calls in Portland, OR. First I ran the Ordinary Least Squares tool to determine if population was the only factor affecting the 911 calls. The coefficient of the POP variable is 0.016, so although this shows a positive relationship in that an increase in population lead to an increase in 911 calls, the relationship is a weak one, as the coefficient is low. I expected a positive result in that I expected an increase in population to lead to an increase in 911 calls, but I expected the relationship to be stronger. Looking at the Jarque-Bera test, there is a very low probability that the residuals are normally distributed. This implies that the model is not properly specified and that more explanatory variables need to be included.
Now that we know that more variables are required for the model, I performed another analysis using 3 variables: Population, Low Education, and Distance to Urban Centers. This model shows that an increase in population or in lower levels of education lead to an increase in 911 calls, and a decrease in the distance from urban centers lead to an increase in 911 calls. The Jacque-Bera is not statistically significant, which means that the data is normally distributed and we are using the correct number of explanatory variables in the model. The VIF for all 3 explanatory variables is between 1 and 2, so the variables are not redundant (I’m not using too many variables). Based on the adjusted R-Squared value, 74% of the variation in the number of 911 calls can be explained by the changes in population, the lower education level, and the distance from urban centers. This is a good model, but how do I know it's the best?

Part D investigates how to determine the best model. This is done using the Exploratory Regression tool. This tool runs a regression analysis on all combinations of the explanatory variables selected, from which we can look at the statistics to determine the best model. In this case, the best model was determined from 4 explanatory variables: Jobs, Low Education, Distance to Urban Centers, and Alcohol. Three of the 4 variables show a positive relationship compared to 911 calls: jobs, low education, and alcohol. The distance to urban centers had a negative relationship compared to 911 calls. The performance of the model is determined partially by the adjusted R-squared value, which is basically a measure of how much of the variation of the dependent variable can be explained by changes in the explanatory variables. Also looked at are the VIF, of which values >7.5 mean that the variables are redundant. The Jacque-Bera statistic is basically a measure of whether or not the residuals are normally distributed. If they are, we have a properly specified model. Another way to tell this is by using the Spatial Correlation (Global Moran's I) tool, which shows a chart displaying whether there is clustering in the residuals or in the correlation is random. If there is clustering, that means that we need to include more variables in the analysis.

No comments:

Post a Comment