The Mechanics of
Categorical Variables
With More Than Two Categories
Gerard E. Dallal, Ph.D.
Categorical variables with only two categories can be included in a multiple regression equation without introducing complications. As already noted, such a predictor specifies a regression surface composed of two parallel hyperplanes. The sign of the regression coefficients determines which plane lies above the other while the magnitude of the coefficient determines the distance between them.
When a categorical variable containing more than two categories is place in a regression model, the coding places specific contstraints on the estimated effects. This can be seen by generalizing the regression model for the t test to three groups. Consider the simple linear regression model
The model forces a specific ordering on the predicted values. The predicted value for the second category must be exactly half-way between first and third category. However, category labels are usually chosen arbitrarily. There is no reason why the group with the middle code can't be the one with the largest or smallest mean value. If the goal is to decide whether the categories are different, a model that treats a categorical variable as though its numerical codes were really numbers is the wrong model.
One way to decide whether g categories are not all the same is to create a set of g-1 indicator variables. Arbitrarily choose g-1 categories and, for each category, define one of the indicator variables to be 1 if the observation is from that category and 0 otherwise. For example, suppose X takes on the values A, B, or C. Create the variables X1 and X2, where X1 = 1 if the categorical variable is A and X2 = 1 if the categorical variable is B, as in
X X1 X2 A 1 0 B 0 1 A 1 0 C 0 0 and so on...
The regression model is now
The hypothesis of no differences between groups can be tested by applying the extra sum of squares principle to the set (X1,X2). This is what ANalysis Of VAriance (ANOVA) routines do automatically.