Econometrics
Linear Regression with Multiple Regressors
Keisuke Kawata
Hiroshima University
Linear Regression with Multiple Regressors
• In the matching approach, we often face the small-sample size problem.
⇒We use alternative approach: Linear Regression with Multiple Regressors.
• In the approach, we assume the conditional means follows a linear function:
where X are control variables.
⇒ We can estimate by using the least square estimation under a couple of assumptions.
�[��|��, �, �, … , ��] =
Population model with multi-regressor
• Supposing the (linear) population model:
where are parameters, and u chapters the effects of other factors.
⇒ � capture the causal effect of treatment.
e.g.,) The estimation for impacts of education on income.
• = + � × � � � + ⇒ u includes
• = + � × � � � + × � � +
⇒ u does not include
OLS estimators in multiple regression
• Same as the single regressor case, the OLS estimators , �, , … , � are determined to minimize the following total s ua ed gap :
�=
�
� − − ��� − � − ⋯ − � ��
• Using the estimator of error terms, � = � − − ��� − � − ⋯ − � ��, a o e gap a e e itte as
�=
�
�
⇒ OLS estimators minimize the the total sum of error terms
Least Squares Assumptions in multiple regression
The least squires assumption in multiple regression 1. Your data is
2. The mean of is zero: 3. Fo a y T a d T’ ⇒ 4. There is no perfect
• If the following least squares assumptions hold, OLS estimator � are –
– have u de the la ge sa ple size.
pu e a do sa pli g data.
� = 0.
Interpretation: � �, , . . ,
�= � �′, , . . ,
�• � �, , . . , � = � �′, , . . , � is called as , which means that there are no covariates except for , . . , �.
If there exist these covariates ⇒ � , . . , � � ′ , . . , ′� .
⇒If all covariates are included in data, conditional independency .
e.g. The impacts of education on income
• e.g.) “upposi g that i o e depe ds o o ly edu atio , ge de , a d lu k . – Education and gender are correlated.
– Luck is not correlated with education and gender.
• If you estimate � = + � × � +
where = × � � +
⇒ OLS estimator have , because includes the effects of gender.
• If you estimate � = + � × � + × � � +
where =
⇒ OLS estimator have , because includes only the effects of luck.
e.g. The impacts of education on income
• Why can we relax the assumption about the error term?
• If you do ’t o t ol the ge de ⇒ You cannot distinguish the impact of
education from the impact of gender because education and gender co-move.
• If you control the gender ⇒ You can estimate the impact of education holding the gender ⇒ You can distinguish the impact of education from the impact of gender.
Multicollinearity
• If there exists between right-hand side variables,
⇒ We cannot get the OLS estimators. e.g.) Effect of birth year
You would like to estimate the effect of birth year on income.
• Current age is correlated with birth year and have any impacts on income.
⇒ We should control, and then our population model is
� = + × �� ℎ � � + × � +
• If you can use only cross section data,
Multicollinearity
• If there exists the perfect linear relationships between explanation variables as
where �� is any value.
⇒ We cannot get the OLS estimators!!!!
E.g.) The effects of birth year
You would like to estimate the effect of birth year on income.
• Current age is correlated with birth year and have any impacts on income
⇒ We should control, and then our population model is
� = + × �� ℎ � � + × � +
• If you can use only cross section data, we cannot get OLS estimators
← The relationship between age and birth year is
⇒ there exists the linear relationship!!!!
Multicollinearity in Practice
• If there exist multicollinearity between and , you cannot distinguish the effect of from the effect of .
• You must drop (or ), and the coefficient of must be interpreted as the effects of
e.g.) Birth year VS Age
If you d op age, the esti ated oeffi ie t of i th yea aptu e the effe t of a increase in birth year and a decrease in age.
Multicollinearity in Practice
• If there exist multicollinearity between control variables ⇒ No serious problem.
← If you hold one variable constant by controlling, another variable is automatically
• If there exist multicollinearity between explanation variables⇒ Serious problem. You should change
⇒
⇒
The variance of OLS estimators
• The variance of OLS estimator � is large if – The sample size is small.
– The variation of � holding other regressor constant is (If there are no variation as multicollineality, the variance is ).
– The number of regressors are
Question
• True/False question. Suppose the pure random sampling data.
1. If there exist correlation between treatment and control variables, we cannot get the unbiased OLS estimators.
2. If all covariates can be controlled, you can get unbiased OLS estimators.
3. If you can use repeated cross section data (e.g. 1997 Bangladesh Household Survey), you can distinguish the effect of age from the effect of birth year. 4. To get the estimator of the causal effect of gender on income, you should
control the education year.
Bad o t ol
• If a variable is determined by explanation variables, you should not control.
• The effect of �� through changes in � should be the interpreted as
• If you control �, your OLS estimator is not equal to the causal effects.
�� (↑) �(↑ or ↓)
� (↑)
Other factors
�
Guideline: Control variable
• What’s a ia les e should o t ol?
⇒ Variables have impacts on variables.
• What’s a ia les e should not control?
⇒Variables are determined by
• Actually, there exists the trade-off:
– The e efit of o t ol a ia les ⇒ Reducing the – The osts of o t ol a ia les ⇒ Increasing
• If your sample size is large, you should control
• If your sample size is not large, you should not control any variables which have
The condition of good data
• Including your interest explained and explanation variables.
• Having large sample size.
• Including
Conclusion
• Controlling covariates are strong tool to get unbiased estimators.
• Large number of regressors reduces the efficiency of estimation. ⇒ If sample size is not large, you can control only few variables.
• You should pay attention the multicollinearity problem.
Future issues
• If you cannot control all covariates ⇒ Omitted variable bias still remains.
• How can we do? ⇒ The main issue of last part of this class.