Phillip Y. Lipscy writes:
Thanks for the clarification. Now, my gut feeling
about how to run a regression
with dummies would be simply to add extra columns that are called "Democratic"
and "Republican" with corresponding 1/0 values to the dataframe. This could
be
done relatively easily w/o using the factor function.
Correct. Note that you need to be careful about what is going on with the
intercept. Recall Gary's last lecture. If you have dummies for both Democrat
and Republican, you will need to drop the intercept. Using a "-1" in the
formula for the call to lm is the way to do it.
Like much of what we do in this class, factors are a powerful tool which seem
more trouble than they are worth at the start.
Presumably, the factor function makes things easier.
How exactly do we connect
the factor function to our raw data? This is what I did so far:
p8 <- c("Democratic",
"Republican")
p8f <- factor(p8)
levels(p8f)
[1] "Democratic" "Republican"
You need to double check that the levels was mapped correctly. How do you know
that the Democratic level got mapped to the Democratic code (of +1)?
I'm not quite sure what to do with this now. We
could presumably use tapply()
with the p8 component of our dataframe as a variable inside... and then rbind
that back into the dataframe?
Much too much trouble! What follows is a little example of working with
factors. To simplify things, I just focus on the relationship between
Democratic percentage and party of the winner last time for 1974.
At the beginning, I do the analysis without factors.
dim(d.1974)
[1] 374 10
names(d.1974)
[1] "state"
"district" "incumb" "dem" "rep"
"year" "d.perc" "party" "region"
"time"
summary(d.1974)
state district
incumb dem rep year d.perc party
region time
Min. : 1.0 Min. : 1.0 Min. :-1.0000 Min. : 10333 Min. : 4399 Min.
:1974 Min. :0.164 Min. :-1.000 Not South:309 0: 0
1st Qu.:21.0 1st Qu.: 3.0 1st Qu.:-1.0000 1st Qu.: 54458 1st Qu.: 34740 1st
Qu.:1974 1st Qu.:0.449 1st Qu.:-1.000 South : 65 2: 0
Median :32.0 Median : 7.0 Median : 0.0000 Median : 70060 Median : 57316 Median
:1974 Median :0.554 Median : 1.000 4:374
Mean :35.7 Mean :11.4 Mean :-0.0107 Mean : 71340 Mean : 56494 Mean
:1974 Mean :0.572 Mean : 0.241 6: 0
3rd Qu.:51.0 3rd Qu.:15.0 3rd Qu.: 1.0000 3rd Qu.: 86837 3rd Qu.: 76124 3rd
Qu.:1974 3rd Qu.:0.706 3rd Qu.: 1.000 8: 0
Max. :82.0 Max. :98.0 Max. : 1.0000 Max. :156439 Max. :130184 Max.
:1974 Max. :0.945 Max. : 1.000
lm(d.perc ~ party, data = d.1974)
Call:
lm(formula = d.perc ~ party, data = d.1974)
Coefficients:
(Intercept) party
0.540 0.134
Be sure that you can interpret, in your own words, each of these
regressions. They all tell the same story, but each in its own way. This one
above, says that, in a distric that Republicans won last time, the best guess
for Democratic vote this time is 41%.
Let me know if that doesn't make sense.
lm(d.perc ~ -1 + party, data = d.1974)
Call:
lm(formula = d.perc ~ -1 + party, data = d.1974)
Coefficients:
party
0.264
Note that the above regression is mostly nonsensical. Removing the intercept
(and thereby forcing the regression line to go through zero) can often give
meaningless answers. You should understand why this makes less sense then the
first case.
d.1974$party.factor <- as.factor(d.1974$party)
table(d.1974$party.factor)
-1 1
142 232
levels(d.1974$party.factor) <-
c("Republican", "Democrat")
table(d.1974$party.factor)
Republican Democrat
142 232
So, I have created a factor and double checked that the assignment is fine.
lm(d.perc ~ party.factor, data = d.1974)
Call:
lm(formula = d.perc ~ party.factor, data = d.1974)
Coefficients:
(Intercept) party.factorDemocrat
0.406 0.269
This regression tells the same story as the first. If party is Republican, then
the expected vote is, again, 41%. But if it is Democratic, it is 67%.
lm(d.perc ~ -1 + party.factor, data = d.1974)
Call:
lm(formula = d.perc ~ -1 + party.factor, data = d.1974)
Coefficients:
party.factorRepublican party.factorDemocrat
0.406 0.674
Now dropping the intercept and using factors often leads to the easiest
regression interpretation, as here.
###########################################
That's enough for one message. I'll answer your other questions in a later
message.
By the way (for everyone) please put separate questions in separate e-mails to
the list.
Dave
1f:
When we set up the factor variable, should we just use year - ie 1910,
1920,
etc.;
Yes.
Is the ordinal ranking all that matters?
Not sure I know what you mean by "ordinal" in this context.
I mean, isn't the use of 1900-1990 arbitrary here? Couldn't we use the
Islamic
calender or the Chinese calender or start with the first year as t=0 or t=1 and
count up? Does that make a difference at all? i.e. so long as the years are
ordered correctly, does it matter what specific set of numbers we assign to the
years?
My gut feeling would be that using t=1 would make the betas easier to interpret,
since we really don't care about the fact that Christ was born 1900 years before
our data begins. Is there some reason why using 1900-1990 is preferable?
Basic Matrix question:
A^2 means A*A as opposed to squaring all the components within the matrix, right?
Thanks,
Phillip.
-------------------------------------------------
Phillip Y. Lipscy
Perkins Hall Room #129
35 Oxford Street
Cambridge, MA 02138
(617)493-4893
lipscy(a)fas.harvard.edu
Ph.D. Candidate
Harvard University, FAS, Department of Government
-------------------------------------------------
_______________________________________________
gov1000-list mailing list
gov1000-list(a)fas.harvard.edu
http://www.fas.harvard.edu/mailman/listinfo/gov1000-list
--
David Kane
Lecturer In Government
617-563-0122
dkane(a)latte.harvard.edu
Please avoid sending me Word or PowerPoint attachments.
See
http://www.fsf.org/philosophy/no-word-attachments.html