The Most Important Lesson You'll Ever Learn About Multiple Linear Regression Analysis

The Most Important Lesson You'll Ever Learn About
Multiple Linear Regression Analysis
Gerard E. Dallal, Ph.D.

There are two main reasons for fitting a multiple linear regression:

prediction and
understanding the contribution of a particular predictor^*.

When a model is being developed for prediction, then, with only slight exaggeration, it doesn't matter how it was obtained or what variables are in it. If it turned out that the prevalence of certain disease could be accurately predicted by newspaper sales of the previous week, we wouldn't worry about it. Instead, we'd monitor newspaper sales very closely.

Any number of techniques have been developed for building a predictive model. These include the stepwise procedures of forward selection regression, backwards elimination regression, and stepwise regression, along with looking at all possible regressions. In addition, there are the techniques of exploratory data analysis developed by John Tukey and others.

Questions about the role of individual predictors are different from questions of pure prediction. Unlike the prediction problem where the model is generally not known in advance, identifying the role of a particular predictor generally involves a very few models carefully crafted at the start of the study. It usually involves comparing two models--one model that includes the predictor being investigated and another that other does not. The research question can usually be restated as whether the model including the predictor under study better predicts the outcome than the model that excludes it.

If the analysis involves observational data, the models can be used to determine whether the predictor is associated with the response.
If the analysis involves data from a randomized trial, the models can be used to determine whether the predictor affects the outcome. (For example, the predictor might be 0/1 depending on whether a subject receives placebo or the active treatment.)

There is other terminology that can be used to describe the two types of modeling.

The prediction problem might also be called model building to emphasize that the analyst, at the outset, is not sure of what will be included in the model, that is, the form of the model is unknown.
The problem of identifying the role of a specific predictor might be called inferential modeling since its purpose is to conduct formal statistical inference about a specific predictor, that is, the form of the model is assumed known but there is some uncertainty about the coefficients.

The most important lesson you'll ever learn about multiple linear regression analysis is well-stated by Chris Chatfield in "Model Uncertainty, Data Mining and Statistical Inference", Journal of the Royal Statistical Society, Series A, 158 (1995), 419-486 (p 421),

It is "well known" to be "logically unsound and practically misleading" to make inference as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes.

or, to put it another way,

NEVER MIX THE TWO APPROACHES!

One of the most serious but all-too-common MISUSES of inferential statistics is to

take a model that was developed through exploratory model building and
subject it to the same sorts of statistical tests that are used to validate a model that was specified in advance.

If a model is built from the ground up, there are some things that might be said about its overall predictive capability, but there is little that can be said about the individual components. If you find a paper in which the authors use a model building technique such as stepwise regression and treat the resulting models and coefficients as though the model been specified in advance, be afraid, be very afraid!

This issue is not new. Chatfield gives references to it dating back nearly 30 years. Yet, many practicing data analysts do not fully appreciate the problem, as can be seen by looking at published scientific literature.

----------

^*In previous versions of this note, I'd called this "questions about mechanism". However, that was a poor choice because mechanism is too closely linked to causality. This isn't about cause but, rather, about the role of a particular predictor in making a prediction.