TA session note#9
Shouto Yonekura
June 20, 2016
Abstract TA session on 5th July is going to be canceled.
1 MLE
Suppose that data are the observed value of random variable X from some parametric family of densities or mass functions, X ∼ f (x; θ), where in general θ ∈ Θ ⊆ Rk. Let X be X := (X1, X2, · · · , Xn) and x be x := (x1, x2, · · · , xn). After observing x, the likelihood function is defied by
L(θ) := f (θ; x),
viewed as a function of θ. If X ∼iidf (x, θ), then L(θ) =∑if (θ; xi). Usually we work with log-likelihood function; l(θ) := lnL(θ).
Example1
Let X be a single observation taking values from {0, 1, 2} according to f (x; θ), where θ = θ0or θ1and the values of f (x; θj)({i}) are given below:
x = 0 x = 1 x = 2 θ = θ0 0.8 0.1 0.1 θ = θ1 0.2 0.3 0.5.
If X = 0 is observed, it is more plausible that it came from f (x; θ0), since f (x; θ0)({0}) is much lager than f (x; θ1)({0}). We then estimate θ by θ0. On the other hand, if X = 1 or 2, it is more plausible that it came from f (x; θ1). This implies the following estimator of θ;
T (X) =
{θ0 if X = 0 θ1 if X ̸= 0. This leads to the following natural definition.
Def 9.1 The Maximum Likelihood Estimator(MLE)
Suppose that X ∼ f (x; θ), θ ∈ Θ ⊆ Rk. Let L(θ) be likelihood function. Then MLE ˆθ is defied by θ := supˆ θ∈ΘL(θ),
or := −infθ∈ΘL(θ).
In most case l is defferentiable and ˆθ is obtained by solving the likelihood equation l′(θ) = 0.
Since lnx is a strictly increasing function and L(θ) can be asssumed to be positive without loss of generality, ˆθ is an MLE if and only if it maximizes the log-likelihood function l(θ).
Example2
Let X ∼iidN (µ, σ2). Then the log-likelihood function is given by this: l(θ) = −n2ln(2π) −n2ln(σ2) −2σ12
∑
i(xi− µ) 2.
The likelihood equation becomes:
∂µl(θ) =σ12∑i(xi− µ) = 0,
∂σ2l(θ) = −2σn2 +2σ14
∑
i(xi− µ)2= 0.
By solving these euqations, we can get
ˆ
µ = n−1∑ixi, σˆ2= n−1∑i(xi− ˆµ)2.
Example3
Consider following regression model:
y = Xβ + u , u ∼ Nk(0, σ2I). First, the distribtuin function of µis given by this:
fµ= √1
2πσ2exp(− u′u 2σ2).
By using transformation of variables, we get
fy= fu(y − Xβ) | ∂u∂y |
=√2πσ1 2exp(−(y−Xβ)
′(y−Xβ) 2σ2 ).
Therefore the log-likelihood function of y is given by this:
l(θ) = −n2ln(2π) −n2ln(σ2) −(y−Xβ)2σ′(y−Xβ)2 . The likelihood equation becomes:
∂β(θ)l(θ) =X′y−Xσ2′Xβ = 0,
∂σ2(θ)l(θ) = −2σn2 +2σ14(y − Xβ)′(y − Xβ) = 0.
By solving these euqations, we can get
β = (Xˆ ′X)−1X′y, σˆ2= (y−X ˆβ)n′(y−X ˆβ)= en′e.
Note that ,in the case of OLS, ˆσ2=n−ke′e and this is the unbiased estimator. However, in the case of MLE, E[ ˆσ2] = E[en′e]
= (n−k)σ
2
n
< σ2. this is not the unbiased estimator and called small sample bias.
2 The Fisher information matrix
Prop 9.2
B1 X ∼ {f (θ; X : θ ∈ Θ ⊆ Rk)} is twice differentiable with respect to θ. B2 There exist a integrable function ϕ(x) such that | ∂θif (θ; X) |< ϕ(x) ∀i, x. Under these condtions,
E[∂θilnf (θ; X)] = 0 ∀i holds.
Proof
Without loss of generality, let k = 1. Then
E[∂θlnf (θ; X)] =´ ∂θlnf (θ; X)f (θ; X)dx
=´ ∂θf(θ;X)f(θ;X)f (θ; X)dx
=´ ∂θf (θ; X)dx
= ∂θ´ f (θ; X)dx
= ∂θ1
= 0 Q.E.D.
Def 9.3 The Fisher information matrix
Let {X}n∼ {f (θ; X : θ ∈ Θ ⊆ Rk)} and V [Xn] < ∞ ∀n. Then the Fisher information matrix I(θ) is difined below: I(θ) := E[(∂θlnf (θ; X))(∂θlnf (θ; X))′],
where (i,j) component of I(θ) is
I(θ)ij= E[∂θilnf (θ; X)∂θjlnf (θ; X)] i, j = 1, 2, · · · k.
If k = 1, then
E[(∂θlnf (θ; X))2] = E[(∂θf(θ;X)f(θ;X))2]
=´(∂θf(θ;X)f(θ;X))2f (θ; X)dx
=´ (∂θf(θ;X))
2
f(θ;X) dx.
Prop 9.4 B3 B1 and B2
B4 There exist a integrable function ϕ(x) such that | ∂θ22
if (θ; X) |< ϕ(x) ∀i, x. Under these condtions,
I(θ)ii= −E[∂2θ2
ilnf (θ; X)] holds.
Proof
Without loss of generality, let k = 1. Then
l′′(θ; X) = ∂θ(∂θf(θ;X)f(θ;X))
= ∂
2 θ2f(θ;X)
f(θ;X) −
(∂θf(θ;X) f(θ;X)
)2
holds. Multiplying both side by f (θ; X) and integrating with respect to x, we get E[l′′(θ; X)] =´ ∂
2 θ2f(θ;X)
f(θ;X) f (θ; X)dx −(∂θf(θ;X)f(θ;X))2f (θ; X)dx
=´ ∂θ22f (θ; X)dx − I(θ)
∂θ22´ f (θ; X)dx − I(θ)
= −I(θ). Therefore, I(θ) = −E[∂θ22lnf (θ; X)] holds. Q.E.D.
Example4
Let X ∼iidN (µ, σ2). Then I(θ) could be calculated as follows:
∂2µσ2l(θ) = −σ14
∑
i(xi− µ)
∂µ22l(θ) = −σn2
∂σ222l(θ) =2σn4 −σ16∑i(xi− µ)2 I(θ) = −E
[ −σn2 −σ14
∑
i(xi− µ)
−σ14
∑
i(xi− µ) n 2σ4 −
1 σ6
∑
i(xi− µ) 2
]
= [ n
σ2 0
0 2σn4
] .
Example5
Consider following regression model:
y = Xβ + u , u ∼ Nk(0, σ2I). Then I(θ) could be calculated as follows:
∂βσ2 2l(θ) =X′Xβ−Xσ4 ′y
∂ββ2 ′l(θ) = −Xσ′2X
∂σ222l(θ) = 2σn4 −σ16(y − Xβ)′(y − Xβ) I(θ) = −E
[ −Xσ′2X X′Xβ−Xσ4 ′y
X′Xβ−X′y σ4
n 2σ4 −
1
σ6(y − Xβ)
′(y − Xβ) ]
= −
[ −Xσ′2X
X′Xβ−X′Xβ σ4 X′Xβ−X′Xβ
σ4
n 2σ4 −
nσ2 σ6
]
= [ X′X
σ2 0
0 2σn4 ]
.
Prop 9.5 B5 B1 and B2 B6 {X}n are iid Under these condtions,
nI1(θ) = I(θ)
holds. Where I1(θ) is the Fisher information matrix of X1 and I(θ) is the Fisher information matrix of {Xn}. Proof
Without loss of generality, let k = 1. Since E[l′(θ)] = 0,I(θ) could be rewritten as follows: I(θ) = E[l′(θ)2]
=´ l′(θ)2f (θ; X)dx
= V [l′(θ)].
From the assumption, {Xn} are iid and this implies l′(θ) =∑ni=1l′1(θ; xi). Therefore I(θ) = V [l′(θ)]
= V [∑ni=1l1′(θ; Xi)]
= nV [l′1(θ; X1)] nI1(θ) Q.E.D.
Example6
Let X ∼iidN (µ, σ2). Then I1(θ) could be calculated as follows:
∂µσ2 2l(θ) = −σ14(x1− µ)
∂µ22l(θ) = −σ12
∂2
σ22l(θ) = 1 2σ4 −
1
σ6(x1− µ) 2
I(θ)1= −E
[ −σ12 −σ14(x1− µ)
−σ14(x1− µ) 2σ14 −σ16(x1− µ)2 ]
= [ 1
σ2 0
0 2σ14 ]
.
= 1nI(θ).
3 The Cramer-Rao Lower Bound
Prop9.6 The Cramer-Rao Lower Bound(CRLB)
Let {Xn} ∼iid{f (x; θ : θ ∈ Θ ⊆ Rk)} and X := (X1, X2, · · · , Xn). Moreover, let fX(x) be the joint pdf of X and Ti(X) be the unbiased estimator of θ ∀i.
C1 V [Ti(X)] < ∞ , ∀i
C2 f (x; θ) is differentiable on Θ ∀i. C3 E[∂θ22
ilnf (θ; X)] < ∞ and ∀i. C4 E[∂θilnf (Xi; θ)] = 0 , ∀i. C5 E[T (∂θilnfX(θ; X))] = 1 ∀i. Under these condtions,
V [Ti(X)] ≥ I(θ)−1 ∀i
holds. Proof
Without loss of generality, let k = 1. First we can get following:
∂θE[T (X)] = ∂θ´ T (X)f (x; θ)dx
⇐⇒ 1 = ∂θ´ T (X)f (x; θ)dx
=´ ∂θT (X)f (x; θ)dx
=´ T (X)∂θlnfX(x; θ)fX(x; θ)dx E[T (X)l′(θ)].
Since E[l′(θ)] = 0 (Prop9.2,) this can be rewriten as follows:
E[T (X)l′(θ)] = E[(T (X) − θ)l′(θ)]
= Cov(T (X), l′(θ)).
This leads to
1 = Cov((T (X), l′(θ)))2≤ V [T (X)]V [l′(θ)] , (−1 ≤ Cov(X, Y )
√V [X]√V [Y ] ≤ 1)
= V [T (X)]I(θ). Therefore, V [T (X)] ≥ 1/I(θ) holds. Q.E.D.
Def9.7 Uniformly Minimum Variance Unbiased Estimator(UMVUE) Let T (X) and T′(X) be the unbiased estimator of θ. If
V [T′(X)] ≥ V [T (X)] f or any T′
holds, then T (X) is said to be Uniformly Minimum Variance Unbiased Estimator(UMVUE)
Thm9.8
Let T (X) be the unbiased estimator of θ. If
V [T (X)] = I(θ)−1, ∀θ holds, then T (X) is UMVUE.
Proof Obvious Example7
Let {Xn} ∼iidN (µ, σ2) σ2< ∞. Then ¯X := n−1∑iXi is UMVUE Proof
First we have to check assumptions C1 ∼ C5. C1 V [ ¯X] = n−1σ2< ∞
C2 µis differentiable on Θ ∀i C3 E[∂µ22l(θ)] = −σn2 < ∞
C4 E[∂µl(θ)] = σ12E∑i[(xi− µ)] = 0.
C5 E[T (∂θilnfX(θ; X))] = E[T (σn2( ¯X − µ)]] =σn2E[( ¯X)2− µ( ¯X)] = σn2V [ ¯X] = 1. Thus CRLB is given by 1/I(θ) = σn2 = V [T (X)]. Therefore ¯X := n−1∑iXi is UMVUE.