We all like to think we understand regression to the mean, as it arises frequently in statistics, genetics, psychology, economics, etc.
Roughly speaking, wherever there is a correlation between two variables, we expect the more extreme values of one variable to be associated with less extreme values of the other, which is said to ‘regress’ towards its mean value.
More precisely, for any sets of variable x and y, measured in units of their own standard deviation, if there is a correlation other than 0, 1, or -1 between x and y, then, for any given value of x, the mean value of the y’s corresponding to that value of x will be closer to the mean of all y’s than that value of x is to the mean of all x’s. Since correlation is a symmetrical relationship, the same proposition will be true if we substitute y for x and x for y throughout.
This, or something like it, is the definition usually given of ‘regression to the mean’, and it is commonly said that it is a necessary consequence – or even a ‘mathematical necessity’ – wherever there is an imperfect but non-zero correlation.
So I was disconcerted to find an example that apparently violates the general rule: a case where there is a non-zero correlation (and in fact quite a strong one) but no regression to the mean…
Godless comments: There is a fallacious assumption here, which is that the linear conditional MMSE formula applies to the case of non-normal random variables. See inside .
David B comments: My original post anticipated this objection, saying: “I suppose one response to this puzzle, or paradox, is that the relationship between the variables in this case is not linear, so the standard Pearson formulae for linear regression and correlation are not appropriate. I would agree that the Pearson formulae are not ideal for this case, but I don’t see that in any strict sense they are invalid. The correlation does account for about half of the total variance, which is better than many correlations that are accepted as meaningful.”
So I don’t accept that the use of a linear regression formula is strictly a ‘fallacy’. I note that the statistics text that Godless links to seems to take a similar view, saying: “Such linear estimators may not be optimum; the conditional expected value may be nonlinear and it always has the smallest mean-squared error. Despite this occasional performance deficit, linear estimators have well-understood properties, they interact will [sic: presumably the author means ‘well’] with other signal processing algorithms because of linearity, and they can always be derived, no matter what the problem.”
In the present case, a linear regression is not ‘optimum’, but it accounts for about half the variance, which is not bad, and I’m not sure that any other formula would do much better.
Draw, or imagine, a scattergram as follows.
First draw the x and y axes, with x = y = 0 at the origin, and mark off 2 units (in inches, or whatever) along each axis in both directions from the origin.
Then draw a square with sides of length 2 units in the upper right quadrant, with its lower left corner on the origin. Draw a similar square in the lower left quadrant, with its upper right corner on the origin.
Now fill each square evenly with dots, except that no dots are to fall on the axes themselves.
Let each dot represent a pair of associated x and y observations.
It is evident that:
a. The mean of all the x observations is 0. Similarly, the mean of all the y observations is 0.
b. The mean of all the positive x observations is 1. Similarly, the mean of all the positive y observations is 1, while the mean of the negative x and y observations is -1.
c. The standard deviation (sd) of the x’s is equal to the standard deviation of the y’s. They are both greater than 1. (I estimate that they are around 1.2, but the precise value does not matter.)
d. The pairs of x’s and y’s have a positive covariance, since they all fall in the ‘positive’ quadrants of the scattergram. The covariance is approximately 1.
e. There is a positive correlation between the x’s and y’s. Assuming 1.2 for the standard deviation of the x’s and y’s, the Pearson product-moment correlation coefficient is about 1/1.44 = approx .7.
f. However, there is no regression to the mean. For each positive x value, the mean of the corresponding y values is 1, while for each negative x value it is -1. For half of the x values (those between 1 and 2 or between -1 and -2), the mean value of the corresponding y’s is closer to the mean of all y’s (0) than the x value is to the mean of all x’s (also 0), so it could be said that for these values there is a regression towards the mean, but these are exactly balanced by the other half of the x values, where there is a ‘regression’ of equal size away from the mean. So overall there is no regression. This conclusion is not affected if we measure each variable in sd units.
So it appears that we have correlation, but no regression to the mean. Of course, we can still formulate a ‘regression equation’ to predict the value of x given y or y given x. Since the sd’s of the two variables are the same, the regression coefficients are equal to the correlation coefficient. The predicted values of the dependent variables are always closer to their means than are the given values of the independent variables. So there is a predicted regression. But the actual observed values show no regression to the mean, as usually defined.
I suppose one response to this puzzle, or paradox, is that the relationship between the variables in this case is not linear, so the standard Pearson formulae for linear regression and correlation are not appropriate. I would agree that the Pearson formulae are not ideal for this case, but I don’t see that in any strict sense they are invalid. The correlation does account for about half of the total variance, which is better than many correlations that are accepted as meaningful.
Perhaps a better response is that the ‘population’ of observations is really a combination of two different populations, within each of which the correlation is zero, but which have different means. It is known that a combination of populations with different means gives rise to a correlation sometimes described as ‘spurious’, or an ‘artifact’. However, real-life populations are also often a mixture of heterogeneous sub-populations, and it seems to be a matter of taste how far it is legitimate to combine them together.
Anyway, I thought the puzzle might be of interest, so I would welcome any comments. Of course, the example is a very simple one, but there may be more complicated real-life examples where there is less ‘regression to the mean’ than might be expected simply on the basis of correlation coefficients.
Godless comments:
The mistake here is in assuming that E[Y | X] = r X for arbitrary *non*-normal random variables X and Y. The conditional MMSE minimizer is not a linear function of the measurement (= rX) in the case where X and Y are not jointly-normal random variables.
Reference on the MMSE (= minimum mean square estimator) for Y given X. Note that the estimator for Y given X only has the simple form “rX” in the case where Y and X are correlated standard normal random variables with correlation coefficient r. In the general bivariate normal case, Y = aX+b where a and b are complicated terms. [1]
In the general case, where X and Y are (say) correlated gamma random variables, E[Y|X] need not take on the form of a simple linear function. In general, E[Y|X] = f(X) is a nonlinear function of the measured variable X with which we are predicting the expected mean of the variable Y, and may exhibit behavior such that E[Y|X] = f(X) is GREATER
than the measurement X. This is the opposite of the expected behavior in the linear formula when r is less than 1 (as |r| |X| is less than |X| always). That is, it violates the regression to the mean rule, which is only strictly valid when E[Y|X] = f(X) and |f(X)| < |X| for all X, of which E[Y|X] = rX with |r| less than 1 is a special case.
Reference on linear vs. nonlinear MMSE.
A further subtlety: E[Y|X] need not be the same as argmax P(Y|X). That is, the mean need not equal the mode in the conditional probability distribution.
This situation is purely mathematical and has nothing to do with Galton’s “fallacy”. Perhaps I will make a fuller discussion of this in a later post.
[1] They aren’t *that* complicated. You can easily remember them through the projection formula, as the set of zero-mean random variables is an inner product space and all the standard formulas apply (with the inner product (X,Y) = E[XY]).
Posted by David B at 03:10 AM