Understanding R-Squared Version A
This is a non-technical explanation of the correlation coefficient R and its square, R-Squared.
In this version I have added material to assist readers with technical training.
Correlation Coefficient R
Visualize a scatter diagram. Now normalize the x axis so that all data are centered about zero with x ranging from minus 1 to plus 1.
There must be at least one value of x that equals minus 1. There must be at least one value of x that equals plus 1.
Similarly, center the y axis so that all values range from y=-ymax to +ymax. [Division by ymax comes later. In version A, I delay the normalization of the values of y.]
The values of x and y are otherwise allowed to be independent. The value of y can be anything when x equals minus 1 or plus 1. Similarly, the value of x can be anything when y equals –ymax or +ymax.
Now fit the best possible straight line that includes all values of x. HINT: Excel does this for us.
The presence of randomness constrains the slope of this line since the maximum value of y equals +ymax and the minimum value of y equals -ymax.
Added Technical Details
To a good approximation when the distribution is symmetrical, the straight line passes through x=0 and y=0.
If we take all of the values of y, while ignoring x, we can calculate a total variance VAR_TOTAL and its square root, the TOTAL_STANDARD_DEVIATION. [We actually calculate estimates.]
Ideally, with a normal distribution, the 68% of the values of y would fall between plus and minus one TOTAL_STANDARD_DEVIATION. For a real sample, the range of the data would be within two or three data points of 1.6 to 2.0 times the TOTAL_STANDARD_DEVIATION. That is, ymax would be close to 1.6 to 2.0 times the TOTAL_STANDARD_DEVIATION. Regardless, the data range (within two or three data points) is plus and minus some such multiple “k” times the TOTAL_STANDARD_DEVIATION.
If the correlation were 100%, there would be no scatter outside of the straight line fit. Two TOTAL_STANDARD_DEVIATIONs (i.e., plus and minus one TOTAL_STANDARD_DEVIATION) would correspond to 68% of the total variation of the line as x varies from -1 to +1. The full range of variation would equal 2*k*TOTAL_STANDARD_DEVIATION as x ranges from -1 to +1 (within 2 or 3 data points, when using an actual sample). That is, at x=+1, ymax=k*TOTAL_STANDARD_DEVIATION. At x=-1, y=-ymax=-k* TOTAL_STANDARD_DEVIATION. Squaring, the straight line represents 4*k^2*VAR_TOTAL. It “explains” 100% of the total variance of the data.
When the correlation is between -100% and 100%, the straight line ranges covers a fraction 2*f of the range of the data. The values at x=-1 and x=+1 include -f*ymax and +f*ymax. The fraction f*ymax=f*2*k*TOTAL_STANDARD_DEVIATION. Squaring, the straight line accounts for f^2*4*k^2*VAR_TOTAL. It “explains” f^2 times the total variance of the data.
If we subtract the straight line from all data points y, the difference has a variance equal to (1-f^2)*VAR_TOTAL.
R-squared is f^2. The correlation coefficient R is the fraction of the total range of the data that the straight line covers. The values of the straight line at x=-1 and x=+1 occur at –R*ymax and +R*ymax.
If we normalize the y axis by dividing by ymax, then the straight line equals –R and +R when x=-1 and +1. If the value of the straight line is positive when x is positive, then the correlation is positive. If the value of the straight line is negative when x is positive, then the correlation is negative.
Interpretation of Graphs
The value of the straight line when x=+1 equals the correlation coefficient R times +ymax.
If the slope is zero (that is, if it is a constant), then the correlation coefficient is zero. Knowledge of x tells us nothing about y.
A correlation coefficient of 20% to 30% shows that x has a major influence on values of y. It amounts to a 20% to 30% of the total variation of y. But even after removing the effect of x, the values of y still retain most of their scatter (i.e., randomness). This is because variances add, not standard deviations, when there is randomness.
We can make the effects of x become visible by taking many samples and averaging. The randomness of the average (i.e., mean) and median (mid-point) decrease substantially. When viewed individually, each new sample has the full amount of randomness.
Regression equations allow us to get similar statistical benefits at many different values of x.
R-Squared
As a rule, variances add. Standard deviations do not.
[When variances don’t add, factors are mutually dependent. We introduce correction terms known as covariances.]
The correlation coefficient R is a fraction of the standard deviation (after scaling the x and y axes). R tells us how much x influences y. R-squared tells us how much x influences the variance of y.
The total variance VAR_TOTAL of y, when ignoring x, equals the sum of the variance of the effect of x (which is R-squared times the total variance) plus the variance of what remains (which is (1-R-Squared) times the total variance). If x causes 20% to 30% of the variation of y, it reduces the total variance by a factor of only 0.04 to 0.09. More than 90% of the randomness of each individual sample remains in effect.
Generalization
The scaling that I described restricts statistical distributions to finite values of x and y. This makes sense when handling data.
The actual scaling is different. The actual formulas are what make sense mathematically.
Example
The Stock-Return Predictor has outer confidence limits of minus and plus 6% at Year 10, a total range of 12%.
Recently, I looked at the refinement possible by introducing earnings growth rate adjustments estimates.
Stock Return Predictor with Earnings Growth Rate Adjustment
From the new calculator, I determined that different earnings growth rate estimates could vary the Year 10 most likely (real, annualized, total) return prediction from 0.89% to 3.11% when starting at today’s valuations. The total variation is 2.22%, which is 18.5% of the total range of uncertainty (12%, which is from minus 6% to plus 6%) inherent in Year 10 predictions.
Introducing earnings growth is equivalent to adding a factor with a correlation coefficient R of 18.5% and an R-Squared of 0.0342.
I consider the earning growth rate to be an important factor, especially for bottom-up modeling. This example illustrates that important factors can have low values of R-Squared.
Never dismiss a result simply on the basis of R-Squared. Remember that means and medians can be made visible by collecting more data. Always consider the application.
Have fun.
John Walter Russell
January 24, 2007