Announcement

Partial Correlation Coefficients
Gerard E. Dallal, Ph.D.

Scatterplots, correlation coefficients, and simple linear regression coefficients are inter-related. The scatterplot displays the data. The correlation coefficient measures linear association between the variables. The regression coefficient describes the linear association through a number that gives the expected change in the response per unit change in the predictor.

The coefficients of a multiple regression equation give the change in response per unit change in a predictor when all other predictors are held fixed. This raises the question of whether there are analogues to the correlation coefficient and the scatterplot to summarize the relation and display the data after adjusting for the effects of other variables.

This note answers these questions and illustrates them by using the crop yield example of Hooker reported by Kendall and Stuart in volume 2 of their Advanced Theory of Statistics, Vol, 2, 3rd ed.(example 27.1) Neither Hooker nor Kendall & Stuart provide the raw data, so I have generated a set of random data with means, standard deviations, and correlations identical to those given in K&S. These statistics are sufficient for all of the methods that will be discussed here (Sufficient is a technical term meaning nothing else to do with the data has any effect on the analysis. Any data set with the same values of the sufficient statistics will produce these results.), so the random data will be adequate.

The variables are yields of "seeds' hay" in cwt per acre, spring rainfall in inches and the accumulated temperature above 42 F in the spring for an English area over 20 years. The plots suggest yield and rainfall are positively correlated, while yield and temperature are negatively correlated! This is borne out by the correlation matrix itself.

   Pearson Correlation Coefficients, N = 20 
          Prob > |r| under H0: Rho=0

          YIELD      RAIN      TEMP

YIELD   1.00000   0.80031  -0.39988
                   <.0001    0.0807

RAIN    0.80031   1.00000  -0.55966
         <.0001              0.0103

TEMP   -0.39988  -0.55966   1.00000
         0.0807    0.0103              

Just as the simple correlation coefficient between Y and X describes their joint behavior, the partial correlation describes the behavior of Y and X1 when X2..Xp are held fixed. The partial correlation between Y and X1 holding X2..Xp fixed is denoted .

A partial correlation coefficient can be written in terms of simple correlation coefficients


Thus, rXY|Z = rXY if X & Y are both uncorrelated with Z.

A partial correlation between two variables can differ substantially from their simple correlation. Sign reversals are possible, too. For example, the partial correlation between YIELD and TEMPERATURE holding RAINFALL fixed is 0.09664. While it does not reach statistical significance (P = 0.694), the sample value is positive nonetheless.

The partial correlation between X & Y holding a set of variables fixed will have the same sign as the multiple regression coefficient of X when Y is regressed on X and the set of variables being held fixed. Also,

where t is the t statistic for the coefficient of X in the multiple regression of Y on X and the variables in the list.

Just as the simple correlation coefficient describes the data in an ordinary scatterplot, the partial correlation coefficient describes the data in the partial regression residual plot.

Let Y and X1 be the variables of primary interest and let X2..Xp be the variables held fixed.

For example, the partial correlation of YIELD and TEMP adjusted for RAIN is the correlation between the residuals from regressing YIELD on RAIN and the residuals from regressing TEMP on RAIN. In this partial regression residual plot, the correlation is 0.09664. The regression coefficient of TEMP when the YIELD residuals are regessed on the TEMP residuals is 0.003636. The multiple regression equation for the original data set is

YIELD = 9.298850 + 3.373008 RAIN + 0.003636 TEMP

Because the data are residuals, they are centered around zero. The values, then, are not similar to the original values. However, perhaps this is an advantage. It stops them from being misinterpreted as Y or X1 values "adjusted for X2..Xp".

While the regression of Y on X2..Xp seems reasonable, it is not uncommon to hear questions about adjusting X1, that is, some propose comparing the residuals of Y on X2..Xp with X1directly.

This approach has been suggested many times over the years. Lately, it has been used in the field of nutrition by Willett and Stampfer (AJE, 124(1986):17-22) to produce "calorie-adjusted nutrient intakes", which are the residuals obtained by regressing nutrient intakes on total energy intake. These adjusted intakes are used as predictors in other regression equations. However, total energy intake does not appear in the equations and the response is not adjusted for total energy intake. Willett and Stampfer recognize this, but propose using calorie-adjusted intakes nonetheless. They suggest "calorie-adjusted values in multivariate models will overcomethe problem of high-collinearity frequently observed between nutritional factors", but this is just an artifact of adjusting only some of the factors. The correlation between an adjusted factor and an unadjusted factor is always smaller in magnitude than the correlation between two adjusted factors.

This method was first proposed before the ready availability of computers as a way to approximate multiple regression with two independent variables (regress Y on X1, regress the residuals on X2) and was given the name two-stage regression. Today, however, it is a mistake to use the approximation when the correct answer is easily obtained. If the goal is to report on two variables after adjusting for the effects of another set of variables, then both variables must be adjusted.


Copyright © 2001 Gerard E. Dallal