If I'm running R in an ESS window, what's the command to flush output to
the next prompt? I have this long output and it's taken 15 minutes so far
and still going strong...
thanks
Olivia.
A basic question: when reporting our results from the regression, should we be
interpreting the coefficients of the control variables? If yes, how? When
interpreting the key explanatory variable, we say something like a one unit
increase in this variable results in blah blah controlling for variables X and
Y. Can we say something similar about the control variables?
- Nirmala
--
Nirmala Ravishankar
PhD Candidate, Government Department
Harvard University
Perkins Hall #210
35 Oxford Street
Cambridge, MA 02138
Tel: (617) 493 3460
Ok, suppose your model is
E(Y|X) = a + bX + cZ,
where
Y is starting salary in dollars
X is education in years
Z is parent's income in dollars
and so (if the assumptions hold), when X goes up by one _year_, E(Y|X)
goes up by b holding constant Z.
Ok, so now suppose you are worried that b is not constant over the
observations as the above model assumes. In particular, suppose there is
sex discrimination and so the effect of education is bigger for men then
women. In that case, the above specification is wrong, and b is not the
average causal effect of men and women. At best, the standard error of b
will be too small because there is variation in it that is ignored by the
model. At worst, if b and X are correlated over the observations, the
least squares estimate of b will be biased (note that I'm asserting this
and you haven't seen the proof, but its true).
But in this case, we know the cause of the variation (according to our
theory anyway); its sex. So let's model it. Let's write this equation,
where I'm putting in "_i" to be the subscript i just to be clear what I'm
talking about:
E(Y_i|X) = a + b_iX_i + cZ_i,
note the coefficient on X now varies over i. (forget for a moment that
you have no idea how to estimate this; we'll fix that in a minute.)
ok, so now let's add a 2nd equation:
b_i = d + fS_i, where d and f are constant coefficients and S is sex (1
for males and 0 for females). this equation follows our new theory and
lets b_i vary as a function of sex. So, in particular b_i=d for females
and b_i=d+f for males.
the 2 equations above are called the structural model. when we estimate
it (i'll explain shortly) we get estimates only of a,c,d,f. Once you have
those, its easy to interpret. When X (education) goes up by one year, Y
(income) goes up by $d on average for females and $(d+f) for males. (Note
that this can be easily extended to S representing a continuous variable
too.)
ok, so the only question left is how to estimate it. to do it, substitute
the 2nd eqn into the first equation:
E(Y_i|X) = a + b_iX_i + cZ_i,
= a + (d+fS_i)X_i + cZ_i
= a + dX_i + f(S_i*X_i) + cZ_i
This is known as the reduced form equation. so we have an interaction, but
ignore that. to estimate this equation you regress Y on a constant term,
X, the product of S and X, and Z. this gives unbiased estimates of a, d,
f, and c, which is what we need to interpret everything.
So if someone asks you whether f is significant, a good answer is 'who
cares?' Although it is true that f is the difference in the effects for
females and for males, but that's besides the point. If you think hard
about the regressions you run, you can come up with a lot of interesting
effects by adding additional equations (which produces more complicated
interactions of course). To intepret them ignore the reduced form and
only look at the structural model. (To estimate, use the reduced form and
ignore the structural model.)
Gary
Dave and Tao:
To clarify:
Is dem.win supposed to be coded 0 or -1 for a Republican victory?
This matters for several reasons:
1) GK footnote 10 and estimating the effect for Democrats and
Republicans says to subtract gamma1 from gamma0, which only makes sense if
Repubican is coded as -1.
2) It changes my coefficients in 1a and makes the regression results from
factor analysis different from my other results (because factors are
treated as 0 and 1 if there are only two of them).
3) I just recoded the darned variable for the third time and really don't
want to do it again.
All I'm asking is: Is it okay to use either as long as we specify which
coding we're using? I think this is the most reasonable solution given
the lateness of the hour and the confusion over the codings given in the
GK paper, the emails, and the instructions to the problem set.
Please let me know ASAP!
Olivia.
So,
Nirmala's dataset has 2815 rows. Mine has 3070 rows. I think the problem
is in the cleaning, not the pooling, so someone stop me where I'm going
wrong:
Load the data from the .txt files.
Generate and append the Democratic percentage and party affiliation vars.
Remove rows where
democratic percentage is 0, 1, or NA, and
remove rows where there is no/bad incumbency data (incumbency is NA or 3).
Pool the data...
What is everyone else removing that I'm not?
Thanks.
Olivia.
ravishan(a)fas.harvard.edu writes:
>
> Dear Dave,
All questions should be sent to the list for homeworks, unless there
is a really good reason not to.
> I am not sure what we are supposed to do in 1e. More specifically, I don't
> understand how to apply the footnote from GK. Here is what I tried...let me
> know if this is right.
Could you say in words what you are trying to do? I am having trouble
following your code.
The key footnote is 10 on page 1158. One way to think about it is to
note that instead of estimating the base equation, you need to
estimate a new equation, one with both the incumbency and party
variables (as before) and with an *interaction* between the two. I
think that you want something more like:
lm(dpct ~ dpct.old + dwin + incum + dwin*incum)
and then you need to figure out how the two sets of estimated
coefficiencients relate to one another.
Dave
> > W$newincumb[W$newincumb == -1] <- 2
> > regi <- lm(dpct ~ dpct.old + dwin + as.factor(newincumb), data = W)
> > summary(regi)
>
> Call:
> lm(formula = dpct ~ dpct.old + dwin + as.factor(newincumb), data = W)
>
> Residuals:
> Min 1Q Median 3Q Max
> -0.3682688 -0.0481769 -0.0006633 0.0512496 0.3475647
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 0.137786 0.008399 16.405 < 2e-16 ***
> dpct.old 0.731240 0.016616 44.007 < 2e-16 ***
> dwin -0.038159 0.008711 -4.380 1.23e-05 ***
> as.factor(newincumb)1 0.064584 0.006570 9.830 < 2e-16 ***
> as.factor(newincumb)2 -0.032554 0.005739 -5.673 1.55e-08 ***
> ---
> Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
>
> Residual standard error: 0.07959 on 2810 degrees of freedom
> Multiple R-Squared: 0.7345, Adjusted R-squared: 0.7342
> F-statistic: 1944 on 4 and 2810 DF, p-value: < 2.2e-16
>
>
> Does this mean that the incumbency advantage on proportion of Democratic vote
> is a positive 6.4 percentage points in the case of a Democratic incumbent and a
> negative 3 percentage points for Republican incumbents?
>
>
> - Nirmala
>
> --
> Nirmala Ravishankar
> PhD Candidate, Government Department
> Harvard University
>
> Perkins Hall #210
> 35 Oxford Street
> Cambridge, MA 02138
> Tel: (617) 493 3460
>
--
David Kane
Lecturer in Government
617-563-0122
dkane(a)latte.harvard.edu
hi,
so we are trying to do matching and are failing miserably. this is our code,
which is mostly borrowed from the problem set 3. any suggestions?
bound$dpct.old.cut <- cut(dpct.old, 20)
matching.func<-function(n){
y <-mean(bound[bound$incumb == 1,"dpct"])
vector<-array(NA,n)
##create a vector to store averages of dem vote percentages among the 40
##matching categories
dpct.inc0 <- array(NA, 40)
k<-1
for(q in 1:n){
for(party in 0:1){
for(dempct in 1:20){
matches<-bound[bound$incumb == 0 & bound$dwin == party & levels
(bound$dpct.old.cut)==levels(bound$dpct.old.cut)[dempct],];
draws<-matches[sample(nrow(matches),nrow(bound[bound$incumb == 1 &
bound$dwin == party & levels(bound$dpct.old.cut)== levels(bound$dpct.old.cut)
[dempct],])),];
dpct.inc0[k] <- mean(draws$dpct)
k <- k+1
}
}
##this vector stores differences between the average democratic percentages
when incumbent was Democrat and the
##avg. dem. percentage when there was an open seat
vector[q] <- y-mean(dpct.inc0)
}
return(vector)
}
for 4), i think it's pretty clear what we need. if you find it puzzling,
go back to the definition of identity and diagonal matrix and define an
arbitrary one to evaluate it. or use a numerical example to try for
yourself.
Dear Colleagues,
Pardon my all-too-constant presence in your in-boxes.
We are trying to create a matrix on Nirmala's recommendation, but have not yet
been successful. Our attempt is below. Thoughts on how to successfully do
this are much appreciated.
> help(matrix)
> m1 <- matrix(clean8a$dempct.08)
> m2 <- matrix(clean8a$demwin.08)
> m3 <- matrix(clean8a$incum.10)
> dim(m1)
[1] 2357 1
> m4 <- cbind(m1, m2, m3)
> dim(m4)
[1] 2357 3
So it has the right dimensions...
> is.matrix(m4)
[1] TRUE
And it is a matrix...
> library(MASS)
> help(ginv)
> m5 <- ginv(m4)
Error in svd(X) : NA/NaN/Inf in foreign function call (arg 1)
And yet, the "ginv" function doesn't seem able to operate on it...
> m5 <- t(m4)
> dim(m5)
[1] 3 2357
>
...despite the fact that "t" does.
Best,
Dan
Dear All,
Questions 4a and 4b ask us to find IB, I^2 and AB, A^2. Are those commas some
form of matrix notation, or should we simply find each of the four matrices
given separately?
Best and thanks as always,
Dan