• 検索結果がありません。

TA3 最近の更新履歴 Econometrics Ⅰ 2016 TA session

N/A
N/A
Protected

Academic year: 2018

シェア "TA3 最近の更新履歴 Econometrics Ⅰ 2016 TA session"

Copied!
12
0
0

読み込み中.... (全文を見る)

全文

(1)

TA session note #3

Shouto Yonekura

2016 年 5 月 6 日

目次

1 Statistical Inference 1

2 Quick review of matrix differentiation 2

3 OLS 4

4 Unbiasdness 7

5 Some unpleasant properties of the unbiased estimator 10

6 Gauss-Markov theorem 11

1 Statistical Inference

In this section, we briefly explain what is the statistical inference. First, we define random sample. Def3.1 Random sample

Let X1, · · · , Xn be random variables. If every {Xn} are identically independently distributed with some prob- ability distribution PX, then we say X := (X1, · · · Xn) is random sample from a population PX.

One example of PX is normal distribution 1

2πσ2exp(− (x−µ)2

2 ). Lilke µ and σ2, the variable which characterizes the dsitribution is called parameter and we denote θ. Since uderstangdng the parameter means understanding the population PX, we want to know them from the data and this is our motibation of the statistical inference. The set contains all of possible value which PX takes is called the parameter space and we denote Θ. In the case of normal distribution, Θ is R × (0, ∞). The distribution familiy P = {Pθ: θ ∈ Θ} is called the statistical model.

Let θ0 be a true value of θ. Then the relation between PX and P is described below: PX= Pθ0 ∈ {Pθ: θ ∈ Θ},

that is, we assume that PX is contained in statistical model P. We usually call such model the parametric model. Different θ yileds different models. Thus our aim for the statistical inference is that we seek the true value θ0attains PX(Pθ0) with moving θ on Θ.

(2)

A mesurable function of random variables is called the staristic and we especially call it the estimator in the sense of statistical inference problem. We denote it ˆθ. For example, in the case of inference of the mean, T (x1, · · · , xn) := n−1Pixi is one of estimators. The OLS is one of methods to construct estimators.

2 Quick review of matrix differentiation

Prop3.2

Let aβ be a scalar. Then,

∂aβ

∂β = a.

Proof

We only consider two dimensional case. For each vector, we define a = a b



β = 1 β2



.Then, aβ = aβ1+ bβ2.Thus,

∂aβ

∂β =

∂aβ

∂β1

∂aβ

∂β2

!

=

∂aβ1+bβ2

∂β1

∂aβ1+bβ2

∂β2

!

=a b



= a, Q.E.D

Prop3.3

Let βabe a scalar. Then,

∂βa

∂β = a.

Proof

We only consider two dimensional case. For each vector, we define a =ab



β =β1

2



. Then, βa= β1a + β2b. Thus,

∂βa

∂β =

∂βa

∂β1

∂βa

∂β2

=

∂β1a+β2b

∂β1

∂β1a+β2b

∂β2

!

(3)

Prop3.4

Let A be a matrix and be βAβ a scalar.Then,

∂β

∂β = (A + A

Proof

We only consider the caseβ is two dimensional and A is 2 × 2 matrix. First, we define A =a bc d



β =1 β2

 . Then, βAβ = (β1β2)a b

c d

 β1

β2



= aβ12+ bβ1β2+ cβ1β2+ dβ22. Thus,

∂β

∂β =

∂β

∂β1

∂β

∂β1

=

∂aβ21+bβ1β2+cβ1β2+dβ22

∂β1

∂aβ21+bβ1β2+cβ1β2+dβ22

∂β2

!

=2aβ 1+ bβ2+ cβ2

1+ cβ1+ 2dβ2



=

 2a b + c b + c 2d

 β1

β2



= (a b c d



+a c b d

 )1

β2



= (A + A)β, Q.E.D

Prop3.5

Let βAβ be a scalar where A is a symmetric matrix (i.e, A = A). Then,

∂β

∂β = 2Aβ.

Proof

This is obvious from above result.Q.E.D. Prop3.6

Let aβ be and cβ a scalar. Then,

∂(aβ+cβ)

∂β =

∂aβ

∂β +

∂cβ

∂β

Proof is omitted.

(4)

3 OLS

Next, we explain the Ordinary Least Squares (OLS). The OLS estimator (OLSE) is the most basic and crucial estimation procedure in Statistics and Econometrics. Thus if you do not understnad OLS very well, you will not be able to understand all of this lecture. In the model, the variable in our question is called the dependent variableand denoted y ∈ Rn. The dependent variable might be ralated to several other variable which is called the explanatory variables and denoted xi ∈ Rk. Thus we can assume following “linear” model with error term:

y =

 x1 x2 ... xn

 β + u

⇐⇒ y = Xβ + u.

where β ∈ Rk is the parametor and u ∈ Rnis the error trem (random variable). The model is a set of restrictions on the joint distribution of the dependent and explanatory variables. The linearity means that y is linear in β not in xi and u is additive. Thus following model is also valid;

ln y = ln(X)β + u.

The OLSE is defined below: Def3.7 OLS estimator.

The OLS estimator of β, say ˆβ, is given by:

β := arg minˆ

β∈Θ⊂Rk

kuk2,

where kk is Euclidean norm.

(5)

Prop3.8

Consider following liner regression model:

y

(n×1)

= X(n×k)(k×1)β +(n×1)u u ∼iid F (u)

⇐⇒

 y1

y2

... yn

=

 x1 x2 ... xn

 β1

β2

... βk

 +

 u1

u2

... un

 .

Assume that following conditions hold: A1 All xi are constant.

A2 rankX = k. A3 E [ui] = 0 ∀i. A4 Eu2i = σ2< ∞ ∀i. A5 E [uiuj] = 0 ∀i 6= j.

Then under these assumptions,the OLS estimaor ˆβ ofβ is given by this:

β = (Xˆ X)−1Xy

⇐⇒

 βˆ1

βˆ2

... βˆk

=









x1 x2 · · · xn



 x1 x2 ... xn









−1









x1 x2 · · · xn



 y1

y2

... yn







 .

Remarks

A1 states thall all xi are constant, that is they are not random variables. However, in modern Econometrics and Statistics, we usually treat it as random variables. In this case, all theories which will be discussed below are based on conditional expectation.

A2 states that all vectors xiin data matrix (or design matrix) X are linearly independent. That is, this condtion ensures that there exisist the inverse matrix of XX. To see this, let Im(TA) := {Ax ; x ∈ Rn}(image of TA) and Ker(TA) := {x ∈ Rn; Ax = 0}(kernel of TA). Since Ax = 0 =⇒ AAx = 0 and AAx =⇒ xAAx =k Ax k2= 0 ⇐⇒ Ax = 0,we get Ax = 0 ⇐⇒ AAx = 0. This impleis that Ker(TA) = Ker(TAA) and therefore Im(TA) = Im(TAA). Thus rank(A) = dim(Im(TA)) = dim(Im(TAA)) = rank(AA).

A3 means the mean of the error term is zero.

A4 states that the variance is constant and finite for any time. We call this assumption homoskedasticity . If homesledasticity does not hold, then we say the variance is heteroskedasticity and this is a very tricky problem.

A5 requires that there is no coreration between any obsrvations. In the context of time-series models, A6 states that there is no serial correlation in the error term.

(6)

From A3 to A5, we can write V [u] = σ2In, where In is n-dimensional identity matrix. Proof

By the definition, k u k= uu =Pni u

2

i. Substituting u = y − Xβ into that, we can get kuk2= ky − Xβk2

= (y − Xβ)(y − Xβ)

= yy + βXXβ − βXy − y

= yy + βXXβ − 2βXy. ((yXβ) = βXy). Thus,

∂uu

∂β = 0 + 2X

Xβ − 2Xy = 0 (XX = (XX)

←→ XXβ = Xy (W e call this equation ”normal equation”)

←→ ˆβ = (XX)−1Xy, (A2)  Q.E.D.

Moreover, for any β,ky − Xβk2= ky − X ˆβk2+ kX ˆβ − Xβk2holds, and clearly β = ˆβ minimaizes L.H.S.

Prop 3.9

E[ ˆβ] = β. V [ ˆβ] = σ2(XX)−1. Proof

By the definiton of the OLSE, we can get,

β = (Xˆ X)−1Xy

= (XX)−1X(Xβ + u)

= β + (XX)−1Xu. Taking the expectation, then we can show that

E[β + (XX)−1Xu] = β + (XX)−1XE[u] (A1)

= β. (A3)

(7)

Next we derive the variance as follows:

V [ ˆβ] = Eh( ˆβ − E[ ˆβ])( ˆβ − E[ ˆβ])i

= Eh( ˆβ − β)( ˆβ − β)i

= Eh((XX)−1Xu)((XX)−1Xu)i

= Eh(XX)−1XuuX(XX)−1i

= Ehuui(XX)−1XX(XX)−1 (A1)

= σ2In(XX)−1 (A3 ∼ A5)

= σ2(XX)−1, Q.E.D. Since diagonal elements of V [ ˆβ] are V [ ˆβ1], V [ ˆβ2], · · · V [ ˆβn],

q

V [ ˆβ] yields the standard deviation of ˆβ and we usually call it standard error in the context of regression analysis.

Prop 3.10

In addtion to A1-A5, assume that following condition holds: A6 u ∼ Nn(µ, σ).

Then, ˆβ ∼ Nk(β, σ2(XX)−1). Proof

As we studiyed last class, The m.g.f of the n-dimensional normal distribution N (µ, Σ) is given by MXn(t) = exp(tµ +12tΣt), where t = (t1, · · · tn). On the other hand, the m.g.f of ˆβ can be calculated as follows:

Mβˆ(t) = E[exp(tβ)]ˆ

= Ehexp(tβ + t(XX)−1Xu)i

= exp(tβ)E[exp(t(XX)−1Xu)]

= exp(tβ)Mu(t(XX)−1Xu)

= exp(tβ)exp(σ22t(XX)−1t)

= exp(tβ +σ22t(XX)−1t). Thus, ˆβ ∼ Nk(β, σ2(XX)−1). Q.E.D.

4 Unbiasdness

We now consider what kind of estimators are good for us to estimate unknown parameters. Since the definition of good is now so opaque, we first have to define what is “desirable”.The Mean Squared Error (MSE) is one of the desirable criterion which measures “distnace” from estimators to parametors and defined below:

(8)

Def3.11 Mean Squared Error (MSE)

M SE(ˆθ, θ) := E[| ˆθ − θ |2]

Where θ is n-dimensional parametor and ˆθ is n-dimensional estimator of it.

For simplicity, let ˆθ : Ω → R and θ ∈ Θ ⊂ R. In oreder to understand what the MSE means, we rewrite it as follows:

E[(ˆθ − θ)2] = E[(ˆθ − E[ˆθ] + E[ˆθ] − θ)2]

= E[(ˆθ − E[ˆθ])2− 2(ˆθ − E[ˆθ])(E[ˆθ] − θ) + (E[ˆθ] − θ)2]

= V [ˆθ] − 2(E[ˆθ] − E[ˆθ])(E[ˆθ] − θ) + (E[ˆθ] − θ)2

= V [ˆθ] + (E[ˆθ] − θ)2

= V [ˆθ] + b(θ)2,

where we define b(θ) := E[ˆθ] − θ and we call it bias. This result means that we can decompose MSE into two terms, variance of ˆθ and bias. Obviously the estimator which minimizes the MSE could be the criterion of the desirable estimator, however there is no estimator which uniformly minimizes the MSE for any parametors. That is only God can minimize both V [ˆθ] and bias.Therefore, instead of seeking the best estimator in arbitrary estimators, we have to restrict our estimaros on some class which has certain properties. Unbiasdness is one of them and difined below:

Def3.12 Unbiasdness If for any θ ∈ Θ

E[ˆθ] = θ

holds, then we say ˆθ is the unbiased estimator of θ.

Notice that if ˆθ is unbiased, then b(θ) = 0 and therefore M SE(ˆθ, θ) = V [ˆθ]. Example

Let {Xn} be iid random variables with mean µand variance σ2. Then sample mean ¯X := n−1Pni Xi is the unbiased estimator of population mean µ. We can check it as follows:

E[n1PiXi] = n1PiE[Xi]

= n µ.

Example

Let {Xn} be iid random variables with mean µ and variance σ2. Then sample variance Tn:= 1nPi(Xi− ¯Xn)2

(9)

is not unbiased estimator of population variance σ2. We can check it as follows: E[1

n X

i

(Xi− ¯Xn)2] = 1 nE[

X

i

{(Xi− E[ ¯Xn]) − ( ¯Xn− E[ ¯Xn])}2]

= 1

n{ X

i

E[(Xi− µ)2] − E[2( ¯Xn− E[ ¯Xn])X

i

(Xi− E[ ¯Xn])] +X

i

E[( ¯Xn− E[ ¯Xn])2]}

= 1

n{ X

i

V [Xi] − E[2n( ¯Xn− E[ ¯Xn])1 n

X

i

(Xi− E[ ¯Xn])] +X

i

V [ ¯Xn]}

= 1

n(nσ

2− nV [ ¯X n])

= σ

2(n − 1)

n 6= σ

2.

From this result, we can see that unbiased estimator of population variance σ2 is s2:= n−11 Pi(Xi− ¯Xn)2 and we especially say this unbiased sample variance to emphasize unbiasdness.

Next, we consider the OLSE case. Prop3.13

The OLSE ˆβ is unbiased, that is:

E[ ˆβ] = β. Proof

By the above result, OLSE is give by this:

β = (Xˆ X)−1Xy

= (XX)−1X(Xβ + u). Taking expectation, as seen above, we can get

E[ ˆβ] = E[(XX)−1X(Xβ + u)]

= E[(XX)−1XXβ] + E[u]

= β, Q.E.D.

Prop3.14

Let e := y − X ˆβ. Then

s2:= E[n−kee ],

is the unbiesed estimator of σ2. That is E[s2] = σ2. Proof

By the definition, we can rewrite e as e = y − X ˆβ = y − X(XX)−1Xy = [In− X(XX)−1X]y. Next we define:

MX:= In− X(XX)−1X.

(10)

This matrix MX has following properties:

MX= MX (symmetric) MXMX= MX(idempotent)

MXX = 0n.

Using this matrix we can rewrite e as e = MXy = MX(Xβ + u) = MXu. Thus we can also rewrite n−kee as follows:

s2= n−kee

= u

MX MXu n−k uMXu n−k .

Taking expectation of uMXu, we can get

E[uMXu] =PiPjmijE[uiuj]

=Pimiiσ2 (E[uiuj] = 0 ∀i 6= j)

= σ2Pimii

= σ2traceMX. traceMX can be calculated as follows:

traceMX= traxe(In− X(XX)−1X)

= n − trace(X(XX)−1X)

n − trace((XX)−1XX) (trace(AB) = trace(BA))

= n − trace(Ik) n − k. Therefore,

E[s2] = E[un−kMXu]

= σ2, Q.E.D.

5 Some unpleasant properties of the unbiased estimator

Although unbiasedness is one of “nice” properties as seen above, modern statistics often does not think it is necessary property of an estimator. To understand this, we provide two unpleasant properties of the unbiased estimator.

1

Let ˆθ be the unbiased estimator of θ and g be some function. The questuion is that is g(ˆθ) always an unbiased estimator of g(θ) ?. The answer is “NO”. Supose that s2 is an unbaised sample varinace , that is, s2 =P(Xi− ¯X)/(n − 1). Now we consider an unbiased estimator of σ (standard deviation). If an unbiased estimator of g(θ) is g(ˆθ), you might think that E[s] is an unbiased esimator of σ. Howevr we can show that

σ =σ2=pE[s2]

(11)

In the middle term, we use Jensen’s inequality for concave functions. This result means that σ > E[s] and of course E[s] is not an unbiased esimator of σ.

2

If we stick to use the unbiased estimaror for statistical inference problem, sometimes we have to use “ridicu- lous” estimator. Suppose that random variables {Xn} are normally distibuted with mean µ and variance σ2. Then follwing estimator is the unbiased estimator of σ (proof is omitted):

σ :=pn2Γ(Γ(n−1n2 )

2)

q1

nP(xi− ¯x)2,

where Γ(s) :=´0xs−1e−xdx and ¯x is sample mean. Therefore if we stick to use the unbiased estimator for statistical inference, we use s2 for σ2 and σfor σ. However, this is clearly unnatural way. Hence, in this case, it is unreasonable to striclty adopt the unbiasedness as the criterion of inference.

6 Gauss-Markov theorem

Thm 3.15

Under assumptions A1-A5, OLSE ˆβ of β is efficient in the class of linear unbiased estimator. That is, for any linear unbiased estimator b of β,

V [b] ≥ V [ ˆβ] holds in the matrixt sense.

Proof

Since b is linear in y by the assumption, we can write b = Cy using some k×n matrix C. To prove V [b] ≥ V [ ˆβ], we have to show that

aV [b]a ≥ aV [ ˆβ]a,

for any a ∈ Rk. First we arbitrally fix a ∈ Rk.Then we can get aV [b]a = aV [Cy]a

= (aC)V [y](Ca)

= (aC)σ2In(Ca)

= σ2aCCa. On the other hand, aV [ ˆβ]a is given by this:

aV [ ˆβ]a = aσ2(XX)−1a

= σ2a(XX)−1a. Since b is unbiased, we can get

β = E[b]

= E[Cy]

= CE[y]

= (CX)β,

(12)

and this implies that CX = Ik. From these results, we can see that aCCa − a(XX)−1a

= aCCa − aCX(XX)−1XCa

= aC(In− X(XX)−1X)Ca.

Next, We define MX:=In= X(XX)−1X. As we seen above, this matrix is symmetric and idempotent. Since all eigenvalues of symmetric and idempotent matrix are 0 or 1(see Appendix) ,this means that MX is positive semidifinite. Thus, we can show that

aC(In− X(XX)−1X)Ca = aCMXCa

= (Ca)MX(Ca) ≥ 0,

and this implies that aCCa − a(XX)−1a ≥ 0. From these results, we can finally get that aV [b]a ≥ aV [ ˆβ]a Q.E.D.

Remarks

1 The Gauss-Marlov Theorem syas that OLSE is efficient in the sense that its variance matrix σ2(XX)−1 is smallest among linear unbiased estimators as we seen above. For this reason the OLSE is called the Best Linear Unbiased Estimator (BLUE).

2 It should be emphasized that assumption A3 is critical for proving this theorem and we do not need A6 holds.

3 In addtion to assumptions A1-A5, if we assume A6 also holds, then the varicance matrix of the OLSE is smallest among any estimators since it is function of the complete sufficient statistic (see Yong and Smith(2010)). This result is called the Lehmann–Scheff´e theorem.

4 As we seen above, MSE can be decomposed into its variance term and bias term. Thus under assumptions A1-A5, OLSE attains smallest MSE among linear unbiased estimators.

Appendix

Prop3.16

A symmetric matrix A is idempotent if and only if all eigenvalues of A are 0 or 1. Proof

=⇒If Api = λipi, then A2pi= λiApi = λ2ipi, where pi∈ Rm is the eigen vectror and λi is the eigen value. Since A is idempetent, λ2ipi= A2pi= Api= λpi. This implies that λi is 1 or 0 since λi is a real number.

⇐=Let λ1, · · · λk= 1 and λk+1, · · · λm= 0, then A = p1p1+· · · pkpk, where piis eigenvalue which corresponds to λi and k pik= 1. Since pipi= 1 and pipj = 0 (i 6= j), we get

A2= (p1p + · · · pkp)(p1p + · · · pkp)

参照

関連したドキュメント

As with subword order, the M¨obius function for compositions is given by a signed sum over normal embeddings, although here the sign of a normal embedding depends on the

In this, the first ever in-depth study of the econometric practice of nonaca- demic economists, I analyse the way economists in business and government currently approach

She reviews the status of a number of interrelated problems on diameters of graphs, including: (i) degree/diameter problem, (ii) order/degree problem, (iii) given n, D, D 0 ,

Kilbas; Conditions of the existence of a classical solution of a Cauchy type problem for the diffusion equation with the Riemann-Liouville partial derivative, Differential Equations,

– Solvability of the initial boundary value problem with time derivative in the conjugation condition for a second order parabolic equation in a weighted H¨older function space,

Analogs of this theorem were proved by Roitberg for nonregular elliptic boundary- value problems and for general elliptic systems of differential equations, the mod- ified scale of

Then it follows immediately from a suitable version of “Hensel’s Lemma” [cf., e.g., the argument of [4], Lemma 2.1] that S may be obtained, as the notation suggests, as the m A

It is known that if the Dirichlet problem for the Laplace equation is considered in a 2D domain bounded by sufficiently smooth closed curves, and if the function specified in the