TA3 最近の更新履歴 Econometrics Ⅰ 2016 TA session

(1)

TA session note #3

Shouto Yonekura

2016 _{年 5 月 6 日}

1 Statistical Inference

In this section, we briefly explain what is the statistical inference. First, we define random sample. Def3.1 Random sample

Let X1, · · · , Xn be random variables. If every {Xn} are identically independently distributed with some prob- ability distribution P^X, then we say X := (X1_{, · · · X}n) is random sample from a population P^X^.

One example of P^X is normal distribution^√ ¹

2πσ²^exp(− (x−µ)²

2σ² ). Lilke µ and σ², the variable which characterizes the dsitribution is called parameter and we denote θ. Since uderstangdng the parameter means understanding the population P^X, we want to know them from the data and this is our motibation of the statistical inference. The set contains all of possible value which P^X takes is called the parameter space and we denote Θ. In the case of normal distribution, Θ is R × (0, ∞). The distribution familiy P = {Pθ: θ ∈ Θ} is called the statistical model.

Let θ0 be a true value of θ. Then the relation between P^X and P is described below: P^X= Pθ0 _{∈ {P}θ: θ ∈ Θ},

that is, we assume that P^X is contained in statistical model P. We usually call such model the parametric model. Different θ yileds different models. Thus our aim for the statistical inference is that we seek the true value θ0attains PX(Pθ0) with moving θ on Θ.

(2)

A mesurable function of random variables is called the staristic and we especially call it the estimator in the sense of statistical inference problem. We denote it ˆθ. For example, in the case of inference of the mean, T (x1, · · · , xn) := n⁻¹^P_ixi is one of estimators. The OLS is one of methods to construct estimators.

2 Quick review of matrix differentiation

Prop3.2

Let a^′β be a scalar. Then,

∂a^′β

∂β ^{= a.}

Proof

We only consider two dimensional case. For each vector, we define a = ^a b

β = ^β¹ β2

.Then, a^′β = aβ1+ bβ2.Thus,

∂a^′β

∂β ⁼

∂a^′β

∂β1

∂a^′β

∂β2

!

=

∂aβ1+bβ2

∂β1

∂aβ1+bβ2

∂β2

!

=^a b

= a, Q.E.D

Prop3.3

Let β^′abe a scalar. Then,

∂β^′^a

∂β ^{= a.}

Proof

We only consider two dimensional case. For each vector, we define a =^a_b

β =^β_β¹

2

. Then, β^′^a= β1a + β2b. Thus,

∂β^′^a

∂β ⁼





∂β^′^a

∂β1

∂β^′^a

∂β2





=

∂β1a+β2b

∂β1

∂β1a+β2b

∂β2

!

(3)

Prop3.4

Let A be a matrix and be β^′Aβ a scalar.Then,

∂β^′Aβ

∂β ^{= (A + A}

′)β

Proof

We only consider the caseβ is two dimensional and A is 2 × 2 matrix. First, we define A =^{a b}_c _d

β =^β¹ β2

. Then, β^′Aβ = (β1β2)^{a b}

c d

β1

β2

= aβ₁²+ bβ1β2+ cβ1β2+ dβ₂². Thus,

∂β^′Aβ

∂β ⁼





∂β^′Aβ

∂β1

∂β^′Aβ

∂β1





=

∂aβ²1+bβ1β2+cβ1β2+dβ²2

∂β1

∂aβ²1+bβ1β2+cβ1β2+dβ²2

∂β2

!

=^2aβ_bβ ¹^{+ bβ}²^{+ cβ}²

1+ cβ1+ 2dβ2

=

2a b + c b + c 2d

β1

β2

= (^{a b} c d

+^{a c} b d

)^β¹

β2

= (A + A^′)β, Q.E.D

Prop3.5

Let β^′Aβ be a scalar where A is a symmetric matrix (i.e, A = A^′). Then,

∂β^′Aβ

∂β ^{= 2Aβ.}

Proof

This is obvious from above result.Q.E.D. Prop3.6

Let a^′β be and c^′β a scalar. Then,

∂(a^′β+c^′β)

∂β ⁼

∂a^′β

∂β ⁺

∂c^′β

∂β

Proof is omitted.

(4)

3 OLS

Next, we explain the Ordinary Least Squares (OLS). The OLS estimator (OLSE) is the most basic and crucial estimation procedure in Statistics and Econometrics. Thus if you do not understnad OLS very well, you will not be able to understand all of this lecture. In the model, the variable in our question is called the dependent variableand denoted y ∈ Rⁿ. The dependent variable might be ralated to several other variable which is called the explanatory variables and denoted xi ∈ R^k. Thus we can assume following “linear” model with error term:

y =





 x^′₁ x^′₂ ... x^′_n





 β + u

⇐⇒ y = Xβ + u.

where β ∈ R^k is the parametor and u ∈ Rⁿis the error trem (random variable). The model is a set of restrictions on the joint distribution of the dependent and explanatory variables. The linearity means that y is linear in β not in xi and u is additive. Thus following model is also valid;

ln y = ln(X)β + u.

The OLSE is defined below: Def3.7 OLS estimator.

The OLS estimator of β, say ˆβ, is given by:

β := arg minˆ

β∈Θ⊂R^k

kuk²^,

where kk is Euclidean norm.

(5)

Prop3.8

Consider following liner regression model:

y

(n×1)

= X(n×k)_(k×1)^β ⁺(n×1)^u ^{u ∼}^iid ^{F (u)}

⇐⇒





 y1

y2

... yn







=





 x^′₁ x^′₂ ... x^′_n











 β1

β2

... βk





 +





 u1

u2

... un





 .

Assume that following conditions hold: A1 All xi are constant.

A2 rankX = k. A3 E [ui] = 0 ∀i. A4 Eu²_i = σ²_{< ∞ ∀i.} A5 E [uiuj] = 0 ∀i 6= j.

Then under these assumptions，the OLS estimaor ˆβ ofβ is given by this:

β = (Xˆ ^′X)⁻¹X^′y

⇐⇒





 βˆ1

βˆ2

... βˆk







=











x1 x2 _{· · ·} xn





 x^′₁ x^′₂ ... x^′_n

















−1 









x1 x2 _{· · ·} xn





 y1

y2

... yn















 .

Remarks

A1 states thall all xi are constant, that is they are not random variables. However, in modern Econometrics and Statistics, we usually treat it as random variables. In this case, all theories which will be discussed below are based on conditional expectation.

A2 states that all vectors xiin data matrix (or design matrix) X are linearly independent. That is, this condtion ensures that there exisist the inverse matrix of XX. To see this, let Im(TA) := {Ax ; x ∈ Rⁿ}(image of TA) and Ker(TA) := {x ∈ Rⁿ; Ax = 0}(kernel of T^A^{). Since A}^′x = 0 =⇒ AA^′x = 0 and AA^′_{x =⇒ x}^′AA^′_{x =k} A^′_{x k}²_{= 0 ⇐⇒ A}^′x = 0,we get A^′x = 0 ⇐⇒ AA^′x = 0. This impleis that Ker(TA) = Ker(TAA) and therefore Im(TA) = Im(TAA). Thus rank(A) = dim(Im(TA)) = dim(Im(TAA)) = rank(AA^′).

A3 means the mean of the error term is zero.

A4 states that the variance is constant and finite for any time. We call this assumption homoskedasticity . If homesledasticity does not hold, then we say the variance is heteroskedasticity and this is a very tricky problem.

A5 requires that there is no coreration between any obsrvations. In the context of time-series models, A6 states that there is no serial correlation in the error term.

(6)

From A3 to A5, we can write V [u] = σ²In, where In is n-dimensional identity matrix. Proof

By the definition, k u k= u^′^{u =}^Pⁿi ^u

2

i. Substituting u = y − Xβ into that, we can get kuk²= ky − Xβk²

= (y − Xβ)^′(y − Xβ)

= y^′y + β^′X^′_{Xβ − β}^′X^′_{y − y}^′Xβ

= y^′y + β^′X^′_{Xβ − 2β}^′X^′y. （(y^′Xβ)^′ = β^′X^′y）. Thus,

∂u^′u

∂β ^{= 0 + 2X}

′Xβ − 2X^′^{y = 0} ^（X^′^{X = (X}^′^X)^′^）

←→ X^′^{Xβ = X}^′^y (W e call this equation ”normal equation”)

←→ ˆ^{β = (X}^′^X)⁻¹^X^′^y, (A2） Q.E.D.

Moreover, for any β_{，ky − Xβk}²_{= ky − X ˆ}_βk²_{+ kX ˆ}_{β − Xβk}²holds, and clearly β = ˆβ minimaizes L.H.S.

Prop 3.9

E[ ˆβ] = β. V [ ˆβ] = σ²(X^′X)⁻¹. Proof

By the definiton of the OLSE, we can get,

β = (Xˆ ^′X)⁻¹X^′y

= (X^′X)⁻¹X^′(Xβ + u)

= β + (X^′X)⁻¹X^′u. Taking the expectation, then we can show that

E[β + (X^′X)⁻¹X^′u] = β + (X^′X)⁻¹X^′E[u] (A1)

= β. (A3)

(7)

Next we derive the variance as follows:

V [ ˆβ] = E^h( ˆ_{β − E[ ˆ}β])( ˆ_{β − E[ ˆ}β])^′ⁱ

= E^h( ˆ_{β − β)( ˆ}_{β − β)}^′ⁱ

= E^h((X^′X)⁻¹X^′u)((X^′X)⁻¹X^′u)^′ⁱ

= E^h(X^′X)⁻¹X^′uu^′X(X^′X)⁻¹ⁱ

= E^huu^′ⁱ(X^′X)⁻¹X^′X(X^′X)⁻¹ (A1)

= σ²In(X^′X)⁻¹ _{(A3 ∼ A5)}

= σ²(X^′X)⁻¹, Q.E.D. Since diagonal elements of V [ ˆβ] are V [ ˆβ1], V [ ˆβ2], · · · V [ ˆ^βn],

q

V [ ˆβ] yields the standard deviation of ˆβ and we usually call it standard error in the context of regression analysis.

Prop 3.10

In addtion to A1-A5, assume that following condition holds: A6 u ∼ Nn(µ, σ).

Then, ˆ_{β ∼ N}k(β, σ²(X^′X)⁻¹). Proof

As we studiyed last class, The m.g.f of the n-dimensional normal distribution N (µ, Σ) is given by MXⁿ(t) = exp(t^′µ +¹₂t^′Σt), where t = (t1, · · · tn)^′. On the other hand, the m.g.f of ˆβ can be calculated as follows:

M_βˆ(t) = E[exp(t^′β)]^ˆ

= E^hexp(t^′β + t^′(X^′X)⁻¹X^′u)ⁱ

= exp(t^′β)E[exp(t^′(X^′X)⁻¹X^′u)]

= exp(t^′β)M^u(t^′(X^′X)⁻¹X^′u)

= exp(t^′β)exp(^σ₂²t^′(X^′X)⁻¹t)

= exp(t^′β +^σ₂²t^′(X^′X)⁻¹t). Thus, ˆ_{β ∼ N}k(β, σ²(X^′X)⁻¹). Q.E.D.

4 Unbiasdness

We now consider what kind of estimators are good for us to estimate unknown parameters. Since the definition of good is now so opaque, we first have to define what is “desirable”.The Mean Squared Error (MSE) is one of the desirable criterion which measures “distnace” from estimators to parametors and defined below:

(8)

Def3.11 Mean Squared Error (MSE)

M SE(ˆθ, θ) := E[| ˆθ − θ |²^]

Where θ is n-dimensional parametor and ˆθ is n-dimensional estimator of it.

For simplicity, let ˆθ : Ω → R and θ ∈ Θ ⊂ R. In oreder to understand what the MSE means, we rewrite it as follows:

E[(ˆ_{θ − θ)}²] = E[(ˆ_{θ − E[ˆ}θ] + E[ˆ_{θ] − θ)}²]

= E[(ˆ_{θ − E[ˆ}θ])²_{− 2(ˆ}_{θ − E[ˆ}θ])(E[ˆθ] − θ) + (E[ˆθ] − θ)²^]

= V [ˆ_{θ] − 2(E[ˆ}_{θ] − E[ˆ}θ])(E[ˆθ] − θ) + (E[ˆθ] − θ)²

= V [ˆθ] + (E[ˆ_{θ] − θ)}²

= V [ˆθ] + b(θ)²,

where we define b(θ) := E[ˆθ] − θ and we call it bias. This result means that we can decompose MSE into two terms, variance of ˆθ and bias. Obviously the estimator which minimizes the MSE could be the criterion of the desirable estimator, however there is no estimator which uniformly minimizes the MSE for any parametors. That is only God can minimize both V [ˆθ] and bias.Therefore, instead of seeking the best estimator in arbitrary estimators, we have to restrict our estimaros on some class which has certain properties. Unbiasdness is one of them and difined below:

Def3.12 Unbiasdness If for any θ ∈ Θ

E[ˆθ] = θ

holds, then we say ˆθ is the unbiased estimator of θ.

Notice that if ˆθ is unbiased, then b(θ) = 0 and therefore M SE(ˆθ, θ) = V [ˆθ]. Example

Let {Xn} be iid random variables with mean µand variance σ². Then sample mean ¯X := n⁻¹^Pⁿ_i Xi is the unbiased estimator of population mean µ. We can check it as follows:

E[_n¹^P_iXi] = _n¹^P_iE[Xi]

= ^nµ_n µ.

Example

Let {Xⁿ} be iid random variables with mean µ and variance σ². Then sample variance Tn:= ¹_n^P_i(Xi_{− ¯}Xn)²

(9)

is not unbiased estimator of population variance σ². We can check it as follows: E[¹

n X

i

(Xi− ¯^Xn)²] = ¹ n^E[

X

i

{(Xi− E[ ¯^Xn]) − ( ¯^Xn− E[ ¯^Xn])}²^]

= ¹

n^{ X

i

E[(Xi− µ)²] − E[2( ¯^Xn− E[ ¯^Xn])^X

i

(Xi− E[ ¯^Xn])] +^X

i

E[( ¯Xn− E[ ¯^Xn])²_]}

= ¹

n^{ X

i

V [Xi] − E[2n( ¯^Xn− E[ ¯^Xn])¹ n

X

i

(Xi− E[ ¯^Xn])] +^X

i

V [ ¯Xn]}

= ¹

n^(nσ

2_{− nV [ ¯}_X n])

= ^σ

2_{(n − 1)}

n ^{6= σ}

2_.

From this result, we can see that unbiased estimator of population variance σ² is s²:= _n−1¹ ^P_i(Xi− ¯^Xn)² and we especially say this unbiased sample variance to emphasize unbiasdness.

Next, we consider the OLSE case. Prop3.13

The OLSE ˆβ is unbiased, that is:

E[ ˆβ] = β. Proof

By the above result, OLSE is give by this:

β = (Xˆ ^′X)⁻¹X^′y

= (X^′X)⁻¹X^′(Xβ + u). Taking expectation, as seen above, we can get

E[ ˆβ] = E[(X^′X)⁻¹X^′(Xβ + u)]

= E[(X^′X)⁻¹X^′Xβ] + E[u]

= β, Q.E.D.

Prop3.14

Let e := y − X ˆ^{β. Then}

s²:= E[_n−k^e^′^e ],

is the unbiesed estimator of σ². That is E[s²] = σ². Proof

By the definition, we can rewrite e as e = y − X ˆβ = y − X(X^′^X)⁻¹^X^′^{y = [I}n− X(X^′^X)⁻¹^X^′]y. Next we define:

MX:= In− X(X^′^X)⁻¹^X^′^.

(10)

This matrix MX has following properties:

MX= M_X^′ (symmetric) MXMX= MX(idempotent)

MXX = 0n.

Using this matrix we can rewrite e as e = MXy = MX(Xβ + u) = MXu. Thus we can also rewrite _n−k^e^′^e as follows:

s²= _n−k^e^′^e

= ^u

′MX^′ M^Xu n−k u^′MXu n−k ^.

Taking expectation of u^′MXu, we can get

E[u^′MXu] =^P_i^P_jmijE[uiuj]

=^P_imiiσ² (E[uiuj] = 0 ∀i 6= j)

= σ²^P_imii

= σ²traceMX. traceMX can be calculated as follows:

traceMX= traxe(In− X(X^′^X)⁻¹^X^′⁾

= n − trace(X(X^′^X)⁻¹^X^′⁾

n − trace((X^′^X)⁻¹^X^′X) (trace(AB) = trace(BA))

= n − trace(Ik) n − k. Therefore,

E[s²] = E[^u^′_n−k^M^X^u]

= σ², Q.E.D.

5 Some unpleasant properties of the unbiased estimator

Although unbiasedness is one of “nice” properties as seen above, modern statistics often does not think it is necessary property of an estimator. To understand this, we provide two unpleasant properties of the unbiased estimator.

1

Let ˆθ be the unbiased estimator of θ and g be some function. The questuion is that is g(ˆθ) always an unbiased estimator of g(θ) ?. The answer is “NO”. Supose that s² is an unbaised sample varinace , that is, s² =P(Xi− ¯X)/(n − 1). Now we consider an unbiased estimator of σ (standard deviation). If an unbiased estimator of g(θ) is g(ˆθ), you might think that E[s] is an unbiased esimator of σ. Howevr we can show that

σ =^√σ²=pE[s²]

(11)

In the middle term, we use Jensen’s inequality for concave functions. This result means that σ > E[s] and of course E[s] is not an unbiased esimator of σ.

2

If we stick to use the unbiased estimaror for statistical inference problem, sometimes we have to use “ridicu- lous” estimator. Suppose that random variables {Xn} are normally distibuted with mean µ and variance σ²^. Then follwing estimator is the unbiased estimator of σ (proof is omitted):

σ^′ :=^pⁿ₂^Γ(_Γ(ⁿ⁻¹ⁿ² ⁾

2⁾

q1

n^P(xⁱ^{− ¯x)}²^,

where Γ(s) :=^´₀^∞x^s−1e^−xdx and ¯x is sample mean. Therefore if we stick to use the unbiased estimator for statistical inference, we use s² for σ² and σ^′for σ. However, this is clearly unnatural way. Hence, in this case, it is unreasonable to striclty adopt the unbiasedness as the criterion of inference.

6 Gauss-Markov theorem

Thm 3.15

Under assumptions A1-A5, OLSE ˆβ of β is efficient in the class of linear unbiased estimator. That is, for any linear unbiased estimator b of β,

V [b] ≥ V [ ˆ^β] holds in the matrixt sense.

Proof

Since b is linear in y by the assumption, we can write b = Cy using some k×n matrix C. To prove V [b] ≥ V [ ˆ^β], we have to show that

a^′_{V [b]a ≥ a}^′V [ ˆβ]a,

for any a ∈ R^k. First we arbitrally fix a ∈ R^k.Then we can get a^′V [b]a = a^′V [Cy]a

= (a^′C)V [y](C^′a)

= (a^′C)σ²In(C^′a)

= σ²a^′CC^′a. On the other hand, a^′V [ ˆβ]a is given by this:

a^′V [ ˆβ]a = a^′σ²(X^′X)⁻¹a

= σ²a^′(X^′X)⁻¹a. Since b is unbiased, we can get

β = E[b]

= E[Cy]

= CE[y]

= (CX)β,

(12)

and this implies that CX = Ik. From these results, we can see that a^′CC^′_{a − a}^′(X^′X)⁻¹a

= a^′CC^′_{a − a}^′CX(X^′X)⁻¹X^′C^′a

= a^′C(In− X(X^′^X)⁻¹^X^′^)C^′^a.

Next, We define MX:=In= X(X^′X)⁻¹X^′. As we seen above, this matrix is symmetric and idempotent. Since all eigenvalues of symmetric and idempotent matrix are 0 or 1(see Appendix) ,this means that MX is positive semidifinite. Thus, we can show that

a^′C(In− X(X^′^X)⁻¹^X^′^)C^′^{a = a}^′^CMXC^′a

= (C^′a)^′MX(C^′_{a) ≥ 0,}

and this implies that a^′CC^′_{a − a}^′(X^′X)⁻¹a ≥ 0. From these results, we can finally get that a^′_{V [b]a ≥ a}^′V [ ˆβ]a Q.E.D.

Remarks

1 The Gauss-Marlov Theorem syas that OLSE is efficient in the sense that its variance matrix σ²(X^′X)⁻¹ is smallest among linear unbiased estimators as we seen above. For this reason the OLSE is called the Best Linear Unbiased Estimator (BLUE).

2 It should be emphasized that assumption A3 is critical for proving this theorem and we do not need A6 holds.

3 In addtion to assumptions A1-A5, if we assume A6 also holds, then the varicance matrix of the OLSE is smallest among any estimators since it is function of the complete sufficient statistic (see Yong and Smith(2010)). This result is called the Lehmann–Scheff´e theorem.

4 As we seen above, MSE can be decomposed into its variance term and bias term. Thus under assumptions A1-A5, OLSE attains smallest MSE among linear unbiased estimators.

Appendix

Prop3.16

A symmetric matrix A is idempotent if and only if all eigenvalues of A are 0 or 1. Proof

=⇒If Api = λipi, then A²pi= λiApi = λ²_ipi, where pi∈ R^m is the eigen vectror and λi is the eigen value. Since A is idempetent, λ²_ipi= A²pi= Api= λpi. This implies that λi is 1 or 0 since λi is a real number.

⇐=Let λ¹, · · · λk= 1 and λk+1_{, · · · λ}m= 0, then A = p1p^′₁_{+· · · p}kp^′_k, where piis eigenvalue which corresponds to λi and k pik= 1. Since p^′i^pi= 1 and p^′_ipj = 0 (i 6= j), we get

A²= (p1p^′ _{+ · · · p}kp^′)(p1p^′ _{+ · · · p}kp^′)

TA3 最近の更新履歴 Econometrics Ⅰ 2016 TA session

TA session note #3

Shouto Yonekura

2016 年 5 月 6 日

目次

1 Statistical Inference

2 Quick review of matrix differentiation

3 OLS

4 Unbiasdness

5 Some unpleasant properties of the unbiased estimator

6 Gauss-Markov theorem

Appendix

2016 _{年 5 月 6 日}