factors and dummy variables in R - gov1000-list

30 Nov 2004

Hi Everyone,

The Ornstein data problem on problem set 8 requires you to deal with
rhs variables that are categorical-- what are called "factors" in R.
The way to deal with such variables is to create a series of dummy
variables that allow the expected value of y to shift across the
different categories of the categorical variable.

Suppose you have a categorical variable X that has 3 categories (A, B,
and C). Assuming there is an intercept in the model, the idea is to
use 2 dummy variables-- one of which is equal to 1 whenever X = B (0
otherwise) and another dummy variable that is equal to 1 whenever X =
C (0 otherwise). The coefficient on the dummy for B tells you how much
higher the mean of y is in category B than the baseline category (here
X = A) holding the other independent variables in the model constant.
The coefficient on the C dummy can be interpreted similarly. In
general if your X has k categories you will use k-1 dummy variables.

One of the *really* nice things about R is that if a categorical
variable is coded as a factor R will automatically create the
appropriate dummy variables for you. Take a look at the following
highly contrived example:

...
  x <- factor(c("Pat", "Pat",
"Pat", "Pat", "Sam", "Sam", "Sam",
"Sam", "Chris", "Chris", "Chris",
"Chris"))
...
  y <- c(5, 5, 5, 5, 12, 12, 12, 12, 30, 30, 30, 30)
 data.frame(y,x)     y     x
1   5   Pat
2   5   Pat
3   5   Pat
4   5   Pat
5  12   Sam
6  12   Sam
7  12   Sam
8  12   Sam
9  30 Chris
10 30 Chris
11 30 Chris
12 30 Chris
...
  lm.out <- lm(y~x)
 summary(lm.out) 
Call:
lm(formula = y ~ x)

Residuals:
       Min         1Q     Median         3Q        Max
-1.634e-14  7.456e-31  1.140e-30  1.534e-30  1.717e-14

Coefficients:
              Estimate Std. Error    t value Pr(>|t|)
(Intercept)  3.000e+01  4.163e-15  7.206e+15   <2e-16 ***
xPat        -2.500e+01  5.888e-15 -4.246e+15   <2e-16 ***
xSam        -1.800e+01  5.888e-15 -3.057e+15   <2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 8.327e-15 on 9 degrees of freedom
Multiple R-Squared:     1,      Adjusted R-squared:     1
F-statistic: 9.596e+30 on 2 and 9 DF,  p-value: < 2.2e-16

Our results tell us that the average level of y is 25 units lower for
Pat than for Chris, and 18 units lower for Sam than for Chris. We can
verify this by looking at the original data.

Alison will also talk about dummy variables a bit today in section.

Hope this helps.

Best,
Kevin

------------------------------------------------------
Kevin Quinn
Assistant Professor
Department of Government and
Center for Basic Research in the Social Sciences
34 Kirkland Street
Harvard University
Cambridge, MA  02138