Here's a much easier way to deal with na's:
Instead of writing cor(x,y), you could write out
var(x,y)/sqrt((var(x)*var(y))).
Notice that when you give var both x and y, it returns the covariance.
But var has an option allowing you to ignore entries with missing
values, 'na.rm' (for removes NA's)
var(x,y,na.rm=TRUE)/sqrt(var(x,na.rm=TRUE)*var(y,na.rm=TRUE))
That gives you the correlation between x and y.
(1) The code is doing what it is supposed to do.
Lmlast is the last model which passed an ftest over its predecessor.
Lmnew is the most recent innovation. So after the code you mailed me, I
find that ftest(lmlast,lmnew) = .059 meaning that lmnew is not
sufficiently better than lmlast, so the while loop closes and you keep
lmlast.
(2) As for the problem with the residuals. That line runs fine. If you
want me to try to replicate your problem, you need to send me everything
you ran prior to getting that error message. I also replied last night
with some help on debugging the problem yourself.
-----Original Message-----
From: Bettina Raquel Scholz [mailto:scholz@fas.harvard.edu]
Sent: Friday, January 09, 2004 10:32 PM
To: jkline(a)fas.harvard.edu
Subject: Fwd: Re: Fwd: RE: Gov 1000 question
# Prospective variables
>
>
> ftestvalue <- 0
> i <- 1
> lmnew <- lmlast
> varsnew <- vars
>
> while (ftestvalue < .05) {
+ lmlast <- lmnew
+ vars <- varsnew
+ text <- paste("inflationMonthlytp",i,"<-
c(inflationMonthly[",(i+1),":length
(Date)],rep(NA,",i,"))",sep="")
+ eval(parse(text=text))
+ text <- paste("unratetp",i,"<- c(unrate[",(i+1),":length(Date)],rep
(NA,",i,"))",sep="")
+ eval(parse(text=text))
+
+ varsnew <- paste(vars," + inflationMonthlytp",i," +
unratetp",i,sep="")
+ text <- paste("lmnew <- lm(approval ~
",varsnew,",subset=usable)",sep="")
+ eval(parse(text=text))
+
+ ftestvalue <- ftest(lmlast,lmnew)
+ i <- i + 1
+
+
+ }
> summary(lmlast)
Call:
lm(formula = approval ~ inflationMonthly + unrate + eisenhower +
kennedy + nixon + ford + carter + regan + bush + inflationMonthlytp1
+
unratetp1 + inflationMonthlytp2 + unratetp2 + inflationMonthlytp3 +
unratetp3 + inflationMonthlytp4 + unratetp4, subset = usable)
Residuals:
Min 1Q Median 3Q Max
-23.6309 -4.9153 0.2136 4.5721 25.3518
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 82.5270 2.2346 36.931 < 2e-16 ***
inflationMonthly -5.0226 1.6678 -3.012 0.002734 **
unrate -0.4890 1.8639 -0.262 0.793159
eisenhower 6.8865 1.2501 5.509 5.85e-08 ***
kennedy 15.9619 1.6339 9.769 < 2e-16 ***
nixon 1.3001 1.4739 0.882 0.378164
ford 10.3949 2.1778 4.773 2.40e-06 ***
carter 10.3501 2.1426 4.831 1.83e-06 ***
regan 11.0687 1.4910 7.424 5.13e-13 ***
bush 13.0309 1.5404 8.460 3.15e-16 ***
inflationMonthlytp1 -3.1411 1.6976 -1.850 0.064876 .
unratetp1 -1.6046 2.7180 -0.590 0.555224
inflationMonthlytp2 -3.6101 1.7142 -2.106 0.035714 *
unratetp2 -2.5850 2.7267 -0.948 0.343589
inflationMonthlytp3 -5.6802 1.6877 -3.366 0.000824 ***
unratetp3 -0.7133 2.7153 -0.263 0.792899
inflationMonthlytp4 -4.7337 1.6414 -2.884 0.004102 **
unratetp4 1.0130 1.9014 0.533 0.594426
---
Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1
Residual standard error: 8.001 on 487 degrees of freedom
Multiple R-Squared: 0.5612, Adjusted R-squared: 0.5458
F-statistic: 36.63 on 17 and 487 DF, p-value: < 2.2e-16
> ftestvalue
[1] 0.05912646
Jacob,
Here is the code that gave us the .059 result. We kind of concluded
that we
might have gotten this because some of the prospective months were
insignificant and others were significant and the .059 was a composit
f-score.
While, I am at it, what about this question:
We keep getting the same error when we try to run the residual part:
> residual <- approval - predict.lm(lmfinal,data.approval)
Warning in approval - predict.lm(lmfinal, data.approval) :
longer object length
is not a multiple of shorter object length
Prior to this message we followed your code as given on the solutions
set.
Thanks
Bettina and company
Quoting soroka(a)fas.harvard.edu:
>
>
> ----- Forwarded message from Jacob Kline <jkline(a)fas.harvard.edu>
-----
> Date: Fri, 9 Jan 2004 20:43:11 -0500
> From: Jacob Kline <jkline(a)fas.harvard.edu>
> Reply-To: Jacob Kline <jkline(a)fas.harvard.edu>
> Subject: RE: Gov 1000 question
> To: soroka(a)fas.harvard.edu
>
> interspersed
>
> -----Original Message-----
> From: soroka(a)fas.harvard.edu [mailto:soroka@fas.harvard.edu]
> Sent: Friday, January 09, 2004 8:35 PM
> To: jkline(a)fas.harvard.edu
> Cc: scholz(a)fas.harvard.edu; sarabyn(a)fas.harvard.edu;
> tung(a)fas.harvard.edu
> Subject: Gov 1000 question
>
>
> Hi Jacob,
>
> The Theory + 1 Soft Comparativist Group is studying tonight. We
were
> running
> the code you gave as an answer to homework 7/#3 and had the following
> question
> arise:
>
> We ran the prospective variables and then did the "ftestvalue", which
> gave us
> an answer of .05912646. However, when we ran "summary(lmlast) we found
> prospective values were included. Why is this? Wouldn't they onluy be
> included
> in the ftestvalue <=.05?
>
> It is always possible that something is wrong with my code. That
said,
> I need you to send me the code the produced the .059 in order to
> investigate. One possible confusion: the way I wrote the code, it
adds
> only one prospective variable at a time and compares models with n and
> n-1 prospective variables. Maybe your .059 corresponds to a
comparison
> between a model with n prospective variables and one with zero? In
> retrospect, that is the right comparison to make, and the way I have
> done it in the key is somewhat unsatisfactory.
>
> Second, looking at lag approval variables we ran:
>
> residual<-approval-predict.lm(lmfinal,data.approval) but got back a
> message "warning in approval...longer object length is not a multiple
of
>
> shorter object length"
>
> Probably you have modified the variable 'approval' from its length in
> the original data set. A quick check. Type length(approval),
> length(data.approval$approval) and
> length(predict.lm(lmfinal,data.approval). Which of these do not
agree?
>
> What's going on?
>
> Thanks,
>
> George
>
> ----- End forwarded message -----
>
>
>
>
----- End forwarded message -----
1. OK - I send you the password tomorrow afternoon.
2. Well -- I didn't exactly personally choose to include all those
different lags. I let the computer keep adding month after month until
the additional r^2 wasn't great enough to justify adding any more
months. That may not be the best way to choose the best model, but it
satisfies the criterion I set up at the beginning of the write-up.
Additionally, it does make sense. We can all agree that the gist of MES
is reasonable, i.e. that the my (presumably accurate) expectations about
the future economy affect my support of the President. OK - but what
and when in particular about the future economy. What justifies picking
out one quarter, or one month? No - what seems reasonable to me is that
my fuzzy sense of the future of the economy is some vague combination of
all the economic indicators for the next year or so. You might say: why
not then just look at the average unemployment or the average inflation
rate over the next few months. To which I would answer: 'sure, ok.
That is very defensible." But try to understand the point of view I
offered in the solutions (which admittedly is not my real point of
view). What I say in the write-up is: 'Why make any kind of decision
about what combination of economic indicators is best (for instance,
average unemployment six months out is just the average of the six
months that are eventually included)? In principle the F Test can tell
me whether I'm justified in adding in some more variables.
Now what also comes up in the write-up is that this approach isn't very
good for answering the kinds of questions in the hw. For instance, if I
want to know how people react to rising unemployment, the model I end up
with is sort of annoying -- too many coefficients, all in different
directions. But bear in mind that the definition of 'best model' which
I started with had nothing to do with answering those kinds of
questions. If I had said 'Give me the best model for answering
questions a through e, the model I came up with definitely wouldn't have
been my choice. In that case, I would probably want one clear indicator
of past, present, and future economic well-being.
I am going in to Littauer for the next few hours, if you want to stop by
the lounge and chat about this.
As for the control variables. Sure - you don't have to use the qq plot.
I hope that was clear enough in the write-up. If you have specific
information about what months are screwy, you should use it. The point
of my ignoring that kind of information was to show you that a totally
blind approach can still figure out which events are outliers -- and
that approach is in some ways superior. (For instance, I never have to
assume that one kind of bad month is just as bad, or twice as bad as the
next month. When you do that, you add an assumption to your work that is
a little ridiculous, but justifiable.)
-----Original Message-----
From: Paul Bodnar [mailto:bodnar@fas.harvard.edu]
Sent: Thursday, January 08, 2004 6:59 PM
To: Jacob Kline
Subject: Re: [gov1000-list] updated list of delayed passwords
Jacob,
Two things. First, I'd like to start the exam on Friday at 5.
Second, my group has a couple of questions about the way you determined
the
"best" model.
Basically we're confused about why you put so many measures of the same
variable
into one model. What's the rationale for including 6 different lags of
the same
variable into a model given that those must be correlated?
Also, I'm just confirming that it is OK to code control events as we've
been
doing instead of using the Q-Q plot as you did in the solution set,
assuming we
state our assumptions etc.
Thanks,
Paul
Quoting Jacob Kline <jkline(a)fas.harvard.edu>:
> Saturday.
>
> Soroka
> Tung
> Scholz
> Liebman
> Mohandas
> Bodnar
>
> Anybody else want to wait until later?
>
>
--
Paul Bodnar
Department of Government
Harvard University
+1 617 230 4525
bodnar(a)fas.harvard.edu
Inflation is seasonally adjusted. inflationMonthly is log(CPIt) -
log(CPIt-1).
-----Original Message-----
From: Adam N Gossen [mailto:agossen@fas.harvard.edu]
Sent: Saturday, January 10, 2004 9:57 AM
To: gov1000(a)fas.harvard.edu
Subject: inflationMonthly variable
Hi Jacob,
I think you might have mentioned this in the past at some point, but can
you refresh my memory about what the "inflationMonthly" variable is? And
also what the difference is between it and "inflation" is?
Thanks,
Nick
Unlike Stata, there is no easy undo command.
If you know exactly what you did, you could undo it.
So unrate <- unrate - 1 would undo your transformation.
If you are going to mess around with the data, it is best to save the
altered variables to different variable names.
The safest bet is to reload the data set entirely if you think anything
might be amiss.
-----Original Message-----
From: Paul Bodnar [mailto:bodnar@fas.harvard.edu]
Sent: Friday, January 09, 2004 9:44 PM
To: Jacob Kline
Subject: R question
When you mess around with variables, like unrate, and then want to start
again
with the original version as it is in the data set, is there an R
command that
lets you reset that variable to its original values? So for example, if
I did
unrate <-unrate+1
and then thought, no I'd like unrate back again, what would I do?
Paul
No need for anything complicated-- you are just supposed to provide R
code which can be cut and paste into R to produce the results of that
table and you should produce a Also, you want to provide a brief
comment describing your result. Does you finding support the estimates
of the original researcher?
-----Original Message-----
From: Adam N Gossen [mailto:agossen@fas.harvard.edu]
Sent: Friday, January 09, 2004 6:13 PM
To: gov1000(a)fas.harvard.edu
Subject: Question 4
Hi Jacob,
Excuse me if you've answered this already - I get the digest and
sometimes
I'm a bit behind.
On question 4 we are just asked to replicate the table.
...
I'm not
having any trouble getting perfect results. So what are we supposed to
write up? Are we supposed to imagine the rationale for constructing such
a
table?
Thanks,
Nick