This week's lab introduced us to introductory statistics, including correlation and bivariate regression. The first part of the lab introduces us to the calculation of simple descriptive statistics using Excel, including median, mean, and standard deviation. We carried out these calculations for 3 different data sets. The second part of the lab introduced us to correlation coefficients. First, we computed this and created a scatterplot of age vs. systolic blood pressure. We also computed the correlation between various demographics for the southeastern United States. I found a strong positive correlation between the % of adults diagnosed as obese and those diagnosed as diabetes, which is a well-known correlation. In Part C we learned to perform bivariate regression. I calculated the intercept and slope and performed a regression analysis using the data analysis tab in Excel. I learned to use the adjusted R-squared value and the p-value for the slope to determine if the relationship is significant. The adjusted R-squared value is basically how well the data fits a regression line. The p-value is a test of whether the null hypothesis that the coefficient is zero can be rejected; the null hypothesis being zero means that the relationship is not significant. A low p-value means that the null hypothesis can be rejected and the relationship between the two variables is significant.
Next we had a time series of annual precipitation for two stations. Over the period of 1950 to 2004 data was available for both stations, but data for station A was missing for the period 1931 to 1949. Our objective was to use regression analysis to estimate the missing precipitation values. I created a scatterplot for the two stations from 1950 to 2004, and performed a bivariate regression analysis using the Data Analysis tool in Excel. Using the slope and the intercept coefficient values, I was able to use the regression formula of y = ax + b, where a is the slope, b is the intercept, and y and x are the annual precipitation values of the two stations. I wanted to solve for x, so from the initial equation:
y - b = ax --> x = (y - b) / a; I used this equation in Excel to calculate the missing annual precipitation values for station A.
Obviously, working for the National Weather Service, we could not use these values as "official" values. The assumptions here are mainly that the missing data will fit the regression line exactly and that the relationship between the two stations was the same between 1931 and 1949, which most likely was not the case. However, it is most likely a reasonable approximation based on the rest of the data.
No comments:
Post a Comment