Simple Regression and Correlation Analysis

At times there seems to be some confusion about simple linear regression vs. correlation analysis. This issue is well summed up in a review by Zou et al. , from which the following information is, for the most part, culled. The two are similar mathematically, but their purposes are different. Both relate to a function that describes the relationship between a given X and Y; however, regression analysis generally focuses on the form of that relationship and correlation generally on the strength. Furthermore, regression focuses on evaluating the relative impact of a predictor variable on a particular (dependent) outcome, whereas correlation's purpose is to examine the strength and direction of the relationship between two random variables.

Correlation analysis commonly involves either the Pearson coefficient (r) or the Spearman coefficient (r2), with the former reflecting proportional changes in one variable when the other is changed, and the latter using ranks and reflecting instead a monotonic relationship between two variables (i.e., whether one tends to take either a larger or a smaller value than the other, but not necessarily with a proportional change in one variable when the other one is changed). If data sets are skewed or contain outliers, the Spearman coefficient rather than the Pearson is the appropriate choice.

Interpretation of correlation coefficients is often rather qualitative, with the sign indicating the direction of the relationship (positive or negative). Values range from 0.0 (no correlation) to 1.0 (perfect correlation), with 0.5 generally being thought of as moderate, 0.8 as strong, and 0.2 as weak. Statistical significance can also be computed, by formulating a null hypothesis of no correlation and a one-sided alternative hypothesis that the underlying value exceeds or is less than that value, then computing the z-test statistic and rejecting the null hypothesis based on the p- value.

Simple linear regression analysis results in an r2 value that is calculated on the basis of Pearson r coefficient and reflects the fraction of the variability in y that can be explained by the variability in x through their linear relationship (or vice versa). As with correlation analysis, a finding of strong linear relationship in a regression analysis does not mean that the variable causes the outcome (as discussed in the section on misuses of correlation), and should not be interpreted that way. Student's i-test can be used, for example, to test if there is a linear relationship (i.e., null hypothesis of slope = 0) or whether the y-intercept is a particular value. As a salient point, no extrapolation outside the range of values of the independent variable in the regression analysis should be used to make any predictions. 