Partial Correlation Coefficients

Partial Correlation Coefficients
Gerard E. Dallal, Ph.D.

Scatterplots, correlation coefficients, and simple linear regression coefficients are inter-related. The scatterplot displays the data. The correlation coefficient measures linear association between the variables. The regression coefficient describes the linear association through a number that gives the expected change in the response per unit change in the predictor.

The coefficients of a multiple regression equation give the change in response per unit change in a predictor when all other predictors are held fixed. This raises the question of whether there are analogues to the correlation coefficient and the scatterplot to summarize the relation and display the data after adjusting for the effects of other variables.

This note answers these questions and illustrates them by using the crop yield example of Hooker reported by Kendall and Stuart in volume 2 of their Advanced Theory of Statistics, Vol, 2, 3rd ed.(example 27.1) Neither Hooker nor Kendall & Stuart provide the raw data, so I have generated a set of random data with means, standard deviations, and correlations identical to those given in K&S. These statistics are sufficient for all of the methods that will be discussed here (Sufficient is a technical term meaning nothing else to do with the data has any effect on the analysis. Any data set with the same values of the sufficient statistics will produce these results.), so the random data will be adequate.

The variables are yields of "seeds' hay" in cwt per acre, spring rainfall in inches and the accumulated temperature above 42 F in the spring for an English area over 20 years. The plots suggest yield and rainfall are positively correlated, while yield and temperature are negatively correlated! This is borne out by the correlation matrix itself.

   Pearson Correlation Coefficients, N = 20 
          Prob > |r| under H0: Rho=0

          YIELD      RAIN      TEMP

YIELD   1.00000   0.80031  -0.39988
                   <.0001    0.0807

RAIN    0.80031   1.00000  -0.55966
         <.0001              0.0103

TEMP   -0.39988  -0.55966   1.00000
         0.0807    0.0103

Just as the simple correlation coefficient between Y and X describes their joint behavior, the partial correlation describes the behavior of Y and X₁ when X₂..X_p are held fixed. The partial correlation between Y and X₁ holding X₂..X_p fixed is denoted

A partial correlation coefficient can be written in terms of simple correlation coefficients

Thus, r_XY|Z = r_XY if X & Y are both uncorrelated with Z.

A partial correlation between two variables can differ substantially from their simple correlation. Sign reversals are possible, too. For example, the partial correlation between YIELD and TEMPERATURE holding RAINFALL fixed is 0.09664. While it does not reach statistical significance (P = 0.694), the sample value is positive nonetheless.

The partial correlation between X & Y holding a set of variables fixed will have the same sign as the multiple regression coefficient of X when Y is regressed on X and the set of variables being held fixed. Also,

where t is the t statistic for the coefficient of X in the multiple regression of Y on X and the variables in the list.

Just as the simple correlation coefficient describes the data in an ordinary scatterplot, the partial correlation coefficient describes the data in the partial regression residual plot.

Let Y and X₁ be the variables of primary interest and let X₂..X_p be the variables held fixed.

First, calculate the residuals after regressing Y on X₂..X_p. These are the parts of the Ys that cannot be predicted by X₂..X_p.
Then, calculate the residuals after regressing X₁ on X₂..X_p. These are the parts of the X₁s that cannot be predicted by X₂..X_p.
The partial correlation coefficient between Y and X₁ adjusted for X₂..X_p is the correlation between these two sets of residuals.
The regression coefficient when the Y residuals are regressed on the X₁ residuals is equal to the regression coefficient of X₁ in the multiple regression equation when Y is regressed on the entire set of predictors.

For example, the partial correlation of YIELD and TEMP adjusted for RAIN is the correlation between the residuals from regressing YIELD on RAIN and the residuals from regressing TEMP on RAIN. In this partial regression residual plot, the correlation is 0.09664. The regression coefficient of TEMP when the YIELD residuals are regessed on the TEMP residuals is 0.003636. The multiple regression equation for the original data set is

YIELD = 9.298850 + 3.373008 RAIN + 0.003636 TEMP

Because the data are residuals, they are centered around zero. The values, then, are not similar to the original values. However, perhaps this is an advantage. It stops them from being misinterpreted as Y or X₁ values "adjusted for X₂..X_p".

While the regression of Y on X₂..X_p seems reasonable, it is not uncommon to hear questions about adjusting X₁, that is, some propose comparing the residuals of Y on X₂..X_p with X₁directly.

This approach has been suggested many times over the years. Lately, it has been used in the field of nutrition by Willett and Stampfer (AJE, 124(1986):17-22) to produce "calorie-adjusted nutrient intakes", which are the residuals obtained by regressing nutrient intakes on total energy intake. These adjusted intakes are used as predictors in other regression equations. However, total energy intake does not appear in the equations and the response is not adjusted for total energy intake. Willett and Stampfer recognize this, but propose using calorie-adjusted intakes nonetheless. They suggest "calorie-adjusted values in multivariate models will overcomethe problem of high-collinearity frequently observed between nutritional factors", but this is just an artifact of adjusting only some of the factors. The correlation between an adjusted factor and an unadjusted factor is always smaller in magnitude than the correlation between two adjusted factors.

This method was first proposed before the ready availability of computers as a way to approximate multiple regression with two independent variables (regress Y on X1, regress the residuals on X2) and was given the name two-stage regression. Today, however, it is a mistake to use the approximation when the correct answer is easily obtained. If the goal is to report on two variables after adjusting for the effects of another set of variables, then both variables must be adjusted.