The Most Important Lesson You'll Ever Learn About
Multiple
Linear Regression Analysis
Gerard E. Dallal, Ph.D.
There are two main reasons for fitting a multiple linear regression:
When a model is being developed for prediction, then, with only slight exaggeration, it doesn't matter how it was obtained or what variables are in it. If it turned out that the prevalence of certain disease could be accurately predicted by newspaper sales of the previous week, we wouldn't worry about it. Instead, we'd monitor newspaper sales very closely.
Any number of techniques have been developed for building a predictive model. These include the stepwise procedures of forward selection regression, backwards elimination regression, and stepwise regression, along with looking at all possible regressions. In addition, there are the techniques of exploratory data analysis developed by John Tukey and others.
Questions about the role of individual predictors are different from questions of pure prediction. Unlike the prediction problem where the model is generally not known in advance, identifying the role of a particular predictor generally involves a very few models carefully crafted at the start of the study. It usually involves comparing two models--one model that includes the predictor being investigated and another that other does not. The research question can usually be restated as whether the model including the predictor under study better predicts the outcome than the model that excludes it.
There is other terminology that can be used to describe the two types of modeling.
The most important lesson you'll ever learn about multiple linear regression analysis is well-stated by Chris Chatfield in "Model Uncertainty, Data Mining and Statistical Inference", Journal of the Royal Statistical Society, Series A, 158 (1995), 419-486 (p 421),
It is "well known" to be "logically unsound and practically misleading" to make inference as if a model is known to be true when it has, in fact, been selected from the same data to be used for estimation purposes.or, to put it another way,
One of the most serious but all-too-common MISUSES of inferential statistics is to
This issue is not new. Chatfield gives references to it dating back
nearly 30 years. Yet, many practicing data analysts do not fully
appreciate the problem, as can be seen by looking at published
scientific literature.
---------- *In previous versions of this note, I'd called
this "questions about mechanism". However, that was a poor choice
because mechanism is too closely linked to causality. This
isn't about cause but, rather, about the role of a particular predictor
in making a prediction.