New subject: regression question

17 Nov 2003

Hi Jason,

Thanks for the regression questions.  Here are some answers.  

...
  So I am still trying to puzzle through the logic of
regression,
 particularly the importance of the error term.   
...
  My recollection is that there must be nothing in the
error term that is systematically
 related to both the independent and the dependent variable.  Is
 this correct? 
Everything in the error term must be either unrelated (i.e.,
orthogonal) to the dependent variable (Y) **OR** unrelated to the
independent variables (X).  Stuff in the error term need not be BOTH
unrelated to Y AND X.

Recall this is what makes an experiment work.  In an experiment X is
the treatment and Y the outcome.  There is lots of stuff in the error
term related to Y, but none of it is systematically related to X
(because X is randomly assigned).

As an aside, there is a somewhat small exception to this weakened
condition which allows *some* stuff in the error term to be related to
both X and Y.  This exception is the concept of "post treatment bias",
but we have not yet discussed it in class.  I will probably get to it
today in lecture.

...
  If so, then the question is: is it accurate to then
say that noise
 in the error term that is systematically related to the dependent
 variable but NOT to the independent variables will in fact change
 the value of the coefficients on the included independent variable
 terms, but not the general relationship?   
No.  Anything in the error term which is unrelated (i.e., orthogonal)
to the independent variables cannot effect (in expectation) the
estimates of the coefficients associated with the independent
variables---no matter what the relationship is between the stuff in
the error term and our dependent variable.  Including new variables
which are orthogonal to the included independent variables can only
change our coefficient estimates of the existing variables because of
issues related to efficiency (we will talk about this next week).  But
in expectation, we expect that the coefficient estimates of the
existing variables will remain the same as they are.

...
 Alternatively, it seems to me that an omitted variable
that was
unrelated to the independent variables could still explain all of the
variance in the dependent variable.  Then the coefficients would be
entirely meaningless, except in a sense devoid of any causal
inference. 
Let Z be our left out variable.  And X denote our independent
variables and Y our dependent variable.  Then

IF

A) the left out variable, Z, explains all the variance of dependent
variable, Y.

AND

B) our independent variables (X) are related to Y.  

THEN it is not possible that Z is unrelated (i.e., orthogonal) to X

To put it another way (note that "abs()" denotes absolute value):

If abs(cor(Y,X)) > 0 and abs(cor(Y,Z))=1, then cor(Z,X) cannot equal
zero 

(there are knife edge exceptions having to do with sampling
uncertainty, but they are irrelevant for the general point).

...
  (i am oversimplifying).  let's say, just for the
sake of argument,
 that the "true" determinant of level of foreign ownership was a
 completely exogenous factor unrelated to these independent
 variables-- like a WTO Rule that strictly regulated levels of
 foreign ownership based on another set of criteria like how close
 the president of the host country was to the president of the WTO.
 Say that one could in fact model the entire dependent variable
 perfectly against this independent variable if one knew about it,
 and could measure "friendliness."  But this researcher did not know
 about it, nor how to measure it. 
...
  Would I be correct to infer that the coefficients from
the model he
 did use (if this model were OLS and the assumption of linearity
 held) were useful in summarizing the relationship between these
 independent variables and the dependent variable (in the way of
 cross-tabs, as we discussed last time) but that they tell you
 nothing about the causal relationship?   
Whether the coefficients tell us anything about a causal relationship
depends on a ton of stuff related to research design which I cannot
speak to in this example because I'm not familiar with the article to
which you are referring.  *BUT* it is not possible for this left out
exogenous variable to be unrelated to the independent variables but
explain all of the variance of the dependent variable even though the
independent variables are systematically related to the dependent
variable (see above).

...
  Would i be further correct
 in saying that these coefficients are "correct" insofar as we take
 them to be measuring this direct relationship and not a causal
 relationship?  Or are they "biased" or "incorrect" because of the
 exclusion of the key variable?   
The exclusion of a variable which is orthogonal will not bias our
estimates.

...
  in this fake example, am i correct
 to assume that the inclusion of the key independent variable
 (assuming no measurement error) would reduce the other coefficients
 to zero? Or is this not mathematically quite right? 
We don't get to this questions because of the previous issues.  But we
may get to a related question which you should ask in class today.

Cheers,
JS.

Jason Lakin writes:
...
  Hi Jas.  How are you?  

 So I am still trying to puzzle through the logic of regression, particularly the
importance of the error term.  My recollection is that there must be nothing in the error
term that is systematically related to both the independent and the dependent variable. 
Is this correct?

 If so, then the question is: is it accurate to then say that noise in the error term that
is systematically related to the dependent variable but NOT to the independent variables
will in fact change the value of the coefficients on the included independent variable
terms, but not the general relationship?  Alternatively, it seems to me that an omitted
variable that was unrelated to the independent variables could still explain all of the
variance in the dependent variable.  Then the coefficients would be entirely meaningless,
except in a sense devoid of any causal inference.  

 to take an example:

 in a paper from my IPE class, the model (which does not use OLS, actually, but something
called Tobit because the distribution is truncated at the top-- forget about this for the
moment), is the following:

 dependent variable: percent of US versus Host Country ownership of U.S. multi-national
subsidiaries in LDC's
 the independent variables are: index of bargaining power of gov't, index of
bargaining power of Multi-National Corporation, economic controls.

 (i am oversimplifying).  let's say, just for the sake of argument, that the
"true" determinant of level of foreign ownership was a completely exogenous
factor unrelated to these independent variables-- like a WTO Rule that strictly regulated
levels of foreign ownership based on another set of criteria like how close the president
of the host country was to the president of the WTO.  Say that one could in fact model the
entire dependent variable perfectly against this independent variable if one knew about
it, and could measure "friendliness."  But this researcher did not know about
it, nor how to measure it.  

 Would I be correct to infer that the coefficients from the model he did use (if this
model were OLS and the assumption of linearity held) were useful in summarizing the
relationship between these independent variables and the dependent variable (in the way of
cross-tabs, as we discussed last time) but that they tell you nothing about the causal
relationship?  Would i be further correct in saying that these coefficients are
"correct" insofar as we take them to be measuring this direct relationship and
not a causal relationship?  Or are they "biased" or "incorrect"
because of the exclusion of the key variable?  in this fake example, am i correct to
assume that the inclusion of the key independent variable (assuming no measurement error)
would reduce the other coefficients to zero? Or is this not mathematically quite right?

 i hope this is not too confusing...

 thanks
 jason    
 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
 <HTML><HEAD>
 <META http-equiv=Content-Type content="text/html; charset=iso-8859-1">
 <META content="MSHTML 6.00.2800.1276" name=GENERATOR>
 <STYLE></STYLE>
 </HEAD>
 <BODY bgColor=#ffffff>
 <DIV><FONT face=Arial size=2>Hi Jas.&nbsp; How are you?&nbsp;
</FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>So I am still trying to puzzle through the logic
of 
 regression, particularly the importance of the error term.&nbsp; My recollection 
 is that there must be nothing in the error term that is systematically related 
 to both the independent and the dependent variable.&nbsp; Is this 
 correct?</FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>If so, then the question is: is it accurate to
then 
 say that noise in the error term that is systematically related to the dependent 
 variable but NOT to the independent variables will in fact change the value of 
 the coefficients on the included independent variable terms, but not the general 
 relationship?&nbsp; Alternatively, it seems to me that an omitted variable that 
 was unrelated to the independent variables could still explain all of the 
 variance in the dependent variable.&nbsp; Then the coefficients would be 
 entirely meaningless, except in a sense devoid of any causal inference.&nbsp; 
 </FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>to take an example:</FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>in a paper from my IPE class, the model (which
does 
 not use OLS, actually, but something called Tobit because the distribution is 
 truncated at the top-- forget about this for the moment), is the 
 following:</FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>dependent variable: percent&nbsp;of US
versus Host 
 Country&nbsp;ownership of&nbsp;U.S. multi-national&nbsp;subsidiaries in 
 LDC's</FONT></DIV>
 <DIV><FONT face=Arial size=2>the independent variables are: index of
bargaining 
 power of gov't, index of bargaining power of Multi-National Corporation, 
 economic controls.</FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>(i am oversimplifying).&nbsp; let's
say, just for 
 the sake of argument, that the&nbsp;"true" determinant of level of foreign

 ownership was a completely exogenous factor unrelated to these independent 
 variables-- like a&nbsp;WTO&nbsp;Rule that strictly regulated&nbsp;levels of

 foreign ownership based on another set of criteria like how close the president 
 of the host country was to the president of the WTO.&nbsp; Say that one could in 
 fact model the entire dependent variable perfectly against this independent 
 variable if one knew about it, and could measure "friendliness."&nbsp; 
 But&nbsp;this researcher did not know about it, nor how to measure 
 it.&nbsp;&nbsp;</FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>Would I be correct to infer that the
coefficients 
 from the model&nbsp;he did&nbsp;use (if this model were OLS and the assumption 
 of linearity held)&nbsp;were useful in summarizing&nbsp;the relationship between

 these independent&nbsp;variables and the dependent variable (in the way of 
 cross-tabs, as we discussed last time) but that they tell you nothing about the 
 causal relationship?&nbsp; Would i be further correct in saying that these 
 coefficients are "correct" insofar as we take them to be measuring this direct

 relationship and not a causal relationship?&nbsp; Or&nbsp;are they
"biased" or 
 "incorrect" because of&nbsp;the&nbsp;exclusion of the key 
 variable?&nbsp;&nbsp;in this fake example, am i correct to assume
that&nbsp;the 
 inclusion of the key independent variable (assuming no measurement error) would 
 reduce the other coefficients to zero?&nbsp;Or is this not mathematically quite 
 right?</FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>i hope this is not too
confusing...</FONT></DIV>
 <DIV><FONT face=Arial size=2></FONT>&nbsp;</DIV>
 <DIV><FONT face=Arial size=2>thanks</FONT></DIV>
 <DIV><FONT face=Arial size=2>jason&nbsp;&nbsp;&nbsp;
</FONT></DIV>
 <DIV><FONT face=Arial
size=2></FONT>&nbsp;</DIV></BODY></HTML> 

RE: regression question