Hi Jas.
Thanks for your response. A couple brief follow-ups:
1) It follows that if there was a Z that explained all of the variance in Y and
was completely unrelated to X, that the coefficients on X would be zero,
regardless of whether Z was included or not in the analysis. Correct?
2) If Z only explained some of the variance in Y, but not all, and was also
unrelated to X, then the basic assumption about the error term of OLS holds,
even though our model leaves out important variables that determine Y and might
be more important than those we included. Our coefficient estimates are not
biased. Correct?
3) If Z explains some of the variance in Y and is related to X, but is not
included in the analysis, its presence in the error term systematically biases
our results, making the coefficients wrong. Is it possible to make a statement
like "the more that Z is related to X, the more wrong are your coefficients?"
or is this not meaningful? In other words, if this premise (Z explains some of
Y and is related to X but is not included) holds, what can we say about our
coefficients besides "they are wrong." Can we say how wrong? Can we say
whether the signs are correct? How far off the standard errors are? Anything?
Thanks again.
See you soon
jason
Quoting Jasjeet Singh Sekhon <jasjeet_sekhon(a)harvard.edu>du>:
Hi Jason,
Thanks for the regression questions. Here are some answers.
So I am still trying to puzzle through the logic
of regression,
particularly the importance of the error term.
My recollection is that there must be nothing in
the error term that is
systematically
related to both the independent and the dependent
variable. Is
this correct?
Everything in the error term must be either unrelated (i.e.,
orthogonal) to the dependent variable (Y) **OR** unrelated to the
independent variables (X). Stuff in the error term need not be BOTH
unrelated to Y AND X.
Recall this is what makes an experiment work. In an experiment X is
the treatment and Y the outcome. There is lots of stuff in the error
term related to Y, but none of it is systematically related to X
(because X is randomly assigned).
As an aside, there is a somewhat small exception to this weakened
condition which allows *some* stuff in the error term to be related to
both X and Y. This exception is the concept of "post treatment bias",
but we have not yet discussed it in class. I will probably get to it
today in lecture.
If so, then the question is: is it accurate to
then say that noise
in the error term that is systematically related to the dependent
variable but NOT to the independent variables will in fact change
the value of the coefficients on the included independent variable
terms, but not the general relationship?
No. Anything in the error term which is unrelated (i.e., orthogonal)
to the independent variables cannot effect (in expectation) the
estimates of the coefficients associated with the independent
variables---no matter what the relationship is between the stuff in
the error term and our dependent variable. Including new variables
which are orthogonal to the included independent variables can only
change our coefficient estimates of the existing variables because of
issues related to efficiency (we will talk about this next week). But
in expectation, we expect that the coefficient estimates of the
existing variables will remain the same as they are.
Alternatively, it seems to me that an omitted
variable that was
unrelated to the independent variables could still explain all of the
variance in the dependent variable. Then the coefficients would be
entirely meaningless, except in a sense devoid of any causal
inference.
Let Z be our left out variable. And X denote our independent
variables and Y our dependent variable. Then
IF
A) the left out variable, Z, explains all the variance of dependent
variable, Y.
AND
B) our independent variables (X) are related to Y.
THEN it is not possible that Z is unrelated (i.e., orthogonal) to X
To put it another way (note that "abs()" denotes absolute value):
If abs(cor(Y,X)) > 0 and abs(cor(Y,Z))=1, then cor(Z,X) cannot equal
zero
(there are knife edge exceptions having to do with sampling
uncertainty, but they are irrelevant for the general point).
(i am oversimplifying). let's say, just for
the sake of argument,
that the "true" determinant of level of foreign ownership was a
completely exogenous factor unrelated to these independent
variables-- like a WTO Rule that strictly regulated levels of
foreign ownership based on another set of criteria like how close
the president of the host country was to the president of the WTO.
Say that one could in fact model the entire dependent variable
perfectly against this independent variable if one knew about it,
and could measure "friendliness." But this researcher did not know
about it, nor how to measure it.
Would I be correct to infer that the coefficients
from the model he
did use (if this model were OLS and the assumption of linearity
held) were useful in summarizing the relationship between these
independent variables and the dependent variable (in the way of
cross-tabs, as we discussed last time) but that they tell you
nothing about the causal relationship?
Whether the coefficients tell us anything about a causal relationship
depends on a ton of stuff related to research design which I cannot
speak to in this example because I'm not familiar with the article to
which you are referring. *BUT* it is not possible for this left out
exogenous variable to be unrelated to the independent variables but
explain all of the variance of the dependent variable even though the
independent variables are systematically related to the dependent
variable (see above).
Would i be further correct
in saying that these coefficients are "correct" insofar as we take
them to be measuring this direct relationship and not a causal
relationship? Or are they "biased" or "incorrect" because of the
exclusion of the key variable?
The exclusion of a variable which is orthogonal will not bias our
estimates.
in this fake example, am i correct
to assume that the inclusion of the key independent variable
(assuming no measurement error) would reduce the other coefficients
to zero? Or is this not mathematically quite right?
We don't get to this questions because of the previous issues. But we
may get to a related question which you should ask in class today.
Cheers,
JS.
Jason Lakin writes:
Hi Jas. How are you?
So I am still trying to puzzle through the logic of regression,
particularly the
importance of the error term. My recollection is that there
must be nothing in the error term that is systematically related to both the
independent and the dependent variable. Is this correct?
If so, then the question is: is it accurate to then say that noise in the
error
term that is systematically related to the dependent variable but NOT
to the independent variables will in fact change the value of the
coefficients on the included independent variable terms, but not the general
relationship? Alternatively, it seems to me that an omitted variable that
was unrelated to the independent variables could still explain all of the
variance in the dependent variable. Then the coefficients would be entirely
meaningless, except in a sense devoid of any causal inference.
to take an example:
in a paper from my IPE class, the model (which does not use OLS, actually,
but
something called Tobit because the distribution is truncated at the top--
forget about this for the moment), is the following:
dependent variable: percent of US versus Host Country ownership of U.S.
multi-national subsidiaries in LDC's
the independent variables are: index of
bargaining power of gov't, index
of bargaining power of Multi-National
Corporation, economic controls.
(i am oversimplifying). let's say, just for the sake of argument, that
the
"true" determinant of level of foreign ownership was a completely
exogenous factor unrelated to these independent variables-- like a WTO Rule
that strictly regulated levels of foreign ownership based on another set of
criteria like how close the president of the host country was to the
president of the WTO. Say that one could in fact model the entire dependent
variable perfectly against this independent variable if one knew about it,
and could measure "friendliness." But this researcher did not know about it,
nor how to measure it.
Would I be correct to infer that the coefficients from the model he did
use (if
this model were OLS and the assumption of linearity held) were useful
in summarizing the relationship between these independent variables and the
dependent variable (in the way of cross-tabs, as we discussed last time) but
that they tell you nothing about the causal relationship? Would i be further
correct in saying that these coefficients are "correct" insofar as we take
them to be measuring this direct relationship and not a causal relationship?
Or are they "biased" or "incorrect" because of the exclusion of the
key
variable? in this fake example, am i correct to assume that the inclusion of
the key independent variable (assuming no measurement error) would reduce the
other coefficients to zero? Or is this not mathematically quite right?
i hope this is not too confusing...
thanks
jason
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2800.1276" name=GENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT face=Arial size=2>Hi Jas. How are you?
</FONT></DIV>
<DIV><FONT face=Arial
size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>So I am still trying to puzzle through the
logic of
regression, particularly the importance of the
error term. My
recollection
is that there must be nothing in the error term
that is systematically
related
to both the independent and the dependent
variable. Is this
correct?</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>If so, then the question is: is it accurate
to then
say that noise in the error term that is
systematically related to the
dependent
variable but NOT to the independent variables
will in fact change the
value of
the coefficients on the included independent
variable terms, but not the
general
relationship? Alternatively, it seems
to me that an omitted variable
that
was unrelated to the independent variables could
still explain all of the
variance in the dependent variable.
Then the coefficients would be
entirely meaningless, except in a sense devoid of
any causal
inference.
</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>to take an example:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>in a paper from my IPE class, the model
(which does
not use OLS, actually, but something called Tobit
because the distribution
is
truncated at the top-- forget about this for the
moment), is the
following:</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>dependent variable: percent of US
versus
Host
Country ownership of U.S.
multi-national subsidiaries in
LDC's</FONT></DIV>
<DIV><FONT face=Arial size=2>the independent variables are: index of
bargaining
power of gov't, index of bargaining power of
Multi-National Corporation,
economic controls.</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>(i am oversimplifying). let's
say, just
for
the sake of argument, that
the "true" determinant of level of foreign
ownership was a completely exogenous factor
unrelated to these independent
variables-- like a WTO Rule
that strictly regulated levels
of
foreign ownership based on another set of
criteria like how close the
president
of the host country was to the president of the
WTO. Say that one
could in
fact model the entire dependent variable
perfectly against this
independent
variable if one knew about it, and could measure
"friendliness."
But this researcher did not know about it, nor how to measure
it. </FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>Would I be correct to infer that the
coefficients
from the model he did use (if
this model were OLS and the
assumption
of linearity held) were useful in
summarizing the relationship
between
these independent variables and the
dependent variable (in the way of
cross-tabs, as we discussed last time) but that
they tell you nothing
about the
causal relationship? Would i be further
correct in saying that these
coefficients are "correct" insofar as
we take them to be measuring this
direct
relationship and not a causal
relationship? Or are they
"biased" or
"incorrect" because
of the exclusion of the key
variable? in this fake example, am i correct to assume
that the
inclusion of the key independent variable
(assuming no measurement error)
would
reduce the other coefficients to
zero? Or is this not mathematically
quite
right?</FONT></DIV>
<DIV><FONT face=Arial size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>i hope this is not too
confusing...</FONT></DIV>
<DIV><FONT face=Arial
size=2></FONT> </DIV>
<DIV><FONT face=Arial size=2>thanks</FONT></DIV>
<DIV><FONT face=Arial size=2>jason
</FONT></DIV>
<DIV><FONT face=Arial
size=2></FONT> </DIV></BODY></HTML>
_______________________________________________
gov1000-list mailing list
gov1000-list(a)fas.harvard.edu
http://www.fas.harvard.edu/mailman/listinfo/gov1000-list