Let ˜ θ be MLE of θ .

(1)

4. Asymptotic Normality of MLE:

Let ˜ θ be MLE of θ .

As n goes to infinity, we have the following result:

√ n(˜ θ − θ ) −→ N

  0 , lim

n →∞

( I( θ ) n

) ₋₁ 

 , where it is assumed that lim

n →∞

( I( θ ) n

)

converges.

That is, when n is large, ˜ θ is approximately distributed as follows:

θ ˜ ∼ N (

θ, (I( θ )) ⁻¹ )

.

(2)

Suppose that s(X) = θ ˜ .

When n is large, V(s(X)) is approximately equal to (I( θ )) ⁻ ¹ . 5. Optimization (

最適化

):

MLE of θ results in the following maximization problem:

max

θ

log L( θ ; x) .

We often have the case where the solution of θ is not derived in closed form.

(3)

= ⇒ Optimization procedure 0 = ∂ log L( θ ; x)

∂θ = ∂ log L( θ ^∗ ; x)

∂θ + ∂ ² log L( θ ^∗ ; x)

∂θ∂θ ⁰ ( θ − θ ^∗ ) . Solving the above equation with respect to θ , we obtain the following:

θ = θ ^∗ −

( ∂ ² log L( θ ^∗ ; x)

∂θ∂θ ⁰

) ₋ 1

∂ log L( θ ^∗ ; x)

∂θ .

(4)

Replace the variables as follows:

θ −→ θ ⁽ⁱ ⁺ ¹⁾ θ ^∗ −→ θ ⁽ⁱ⁾

Then, we have:

θ ⁽ⁱ⁺¹⁾ = θ ⁽ⁱ⁾ −

( ∂ ² log L( θ ⁽ⁱ⁾ ; x)

∂θ∂θ ⁰

) ₋ 1

∂ log L( θ ⁽ⁱ⁾ ; x)

∂θ .

= ⇒ Newton-Raphson method (

ニュートン・ラプソン法

)

(5)

Replacing ∂ ² log L( θ ⁽ⁱ⁾ ; x)

∂θ∂θ ⁰ by E

( ∂ ² log L( θ ⁽ⁱ⁾ ; x)

∂θ∂θ ⁰ )

, we obtain the following op- timization algorithm:

θ ⁽ⁱ ⁺ ¹⁾ = θ ⁽ⁱ⁾ − (

E

( ∂ ² log L( θ ⁽ⁱ⁾ ; x)

∂θ∂θ ⁰

)) ₋ 1

∂ log L( θ ⁽ⁱ⁾ ; x)

∂θ

= θ ⁽ⁱ⁾ + (

I( θ ⁽ⁱ⁾ ) ) ₋ 1 ∂ log L( θ ⁽ⁱ⁾ ; x)

∂θ

= ⇒ Method of Scoring (

スコア法

)

(6)

9.1 MLE: The Case of Single Regression Model

The regression model:

y _i = β 1 + β 2 x _i + u _i , 1. u i ∼ N(0 , σ ² ) is assumed.

2. The density function of u _i is:

f (u i ) = 1

√ 2 πσ ² exp (

− 1 2 σ ² u ² _i

) .

Because u ₁ , u ₂ , · · · , u _n are mutually independently distributed, the joint den-

(7)

sity function of u ₁ , u ₂ , · · · , u _n is written as:

f (u 1 , u 2 , · · · , u n ) = f (u 1 ) f (u 2 ) · · · f (u n )

= 1

(2 πσ ² ) ⁿ ^/ ² exp

 

 − 1

2 σ ²

∑ n

i = 1

u ² _i

 



3. Using the transformation of variable (u _i = y _i − β 1 − β 2 x _i ), the joint density function of y ₁ , y ₂ , · · · , y _n is given by:

f (y 1 , y 2 , · · · , y n ) = 1

(2 πσ ² ) ⁿ ^/ ² exp

 

 − 1

2 σ ²

∑ n

i = 1

(y i − β 1 − β 2 x i ) ²

 



≡ L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n ) .

(8)

L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n ) is called the likelihood function.

log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n ) is called the log-likelihood function.

log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n )

= − n

2 log(2 π ) − n

2 log( σ ² ) − 1 2 σ ²

∑ n

i=1

(y _t − β 1 − β 2 x _i ) ²

(9)

4. Transformation of Variable (

変数変換

):

Suppose that the density function of a random variable X is f _x (x).

Defining X = g(Y), the density function of Y, f _y (y), is given by:

f _y (y) = f _x (g(y)) dg(y) dy

. In the case where X and g(Y) are n × 1 vectors, dg(y)

dy

should be replaced by ∂ g(y)

∂ y ⁰

, which is an absolute value of a determinant of the matrix ∂ g(y)

∂ y ⁰ .

(10)

Example: When X ∼ U(0 , 1), derive the density function of Y = − log(X).

f _x (x) = 1

X = exp( − Y) is obtained.

Therefore, the density function of Y, f _y (y), is given by:

f _y (y) = dx dy

f _x (g(y)) = | − exp( − y) | = exp( − y)

(11)

5. Given the observed data y ₁ , y ₂ , · · · , y _n , the likelihood function L( β 1 , β 2 , σ ² | y ₁ , y 2 , · · · , y n ), or the log-likelihood function log L( β 1 , β 2 , σ ² | y 1 , y 2 , · · · , y n ) is maximized with respect to ( α , β , σ ² ).

Solve the following three simultaneous equations:

∂ log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n )

∂α = 1

σ ²

∑ n

i=1

(y _i − β 1 − β 2 x _i ) = 0 ,

∂ log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n )

∂β = 1

σ ²

∑ n

i = 1

(y _i − β 1 − β 2 x _i )x _i = 0 ,

(12)

∂ log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n )

∂σ ² = − n

2 1 σ ² + 1

2 σ ⁴

∑ n

i = 1

(y _i − β 1 − β 2 x _i ) ² = 0 .

The solutions of ( β 1 , β 2 , σ ² ) are called the maximum likelihood estimates, denoted by ( ˜ β 1 , ˜ β 2 , ˜ σ ² ).

The maximum likelihood estimates are:

β ˜ 2 =

∑ n

i = 1 (x _i − x)(y _i − y)

∑ n

i = 1 (x _i − x) ² , β ˜ 1 = y − β ˜ 2 x , σ ˜ ² = 1 n

∑ n

i = 1

(y i − β ˜ 1 − β ˜ 2 x i ) ² .

The MLE of σ ² is divided by n, not n − 2.

(13)

9.2 MLE: The Case of Multiple Regression Model I

1. Multivariate Normal Distribution: X : n × 1 and X ∼ N( µ, Σ )

The density function of X is:

f (x) = (2 π ) ⁿ ^/ ² |Σ| ⁻ ¹ ^/ ² exp (

− 1

2 (x − µ ) ⁰ Σ ⁻ ¹ (x − µ ) )

.

(14)

2. Regression model: y = X β + u, u ∼ N(0 , σ ² I _n ) Transformation of Variables from u to y:

f _u (u) = (2 πσ ² ) ⁻ ⁿ ^/ ² exp (

− 1 2 σ ² u ⁰ u ) f _y (y) = f _u (y − X β ) ∂ u

∂ y ⁰

= (2 πσ ² ) ⁻ ⁿ ^/ ² exp (

− 1

2 σ ² (y − X β ) ⁰ (y − X β ) )

= L( θ ; y , X) , where θ = ( β, σ ² ), because of ∂ u

∂ y ⁰ = I n .

(15)

Therefore, the log-likelihood function is:

log L( θ ; y , X) = − n

2 log(2 πσ ² ) − 1

2 σ ² (y − X β ) ⁰ (y − X β ) , Note that |Σ| ⁻ ¹ ^/ ² = |σ ² I _n | ⁻ ¹ ^/ ² = σ ⁻ ⁿ ^/ ² .

3. max

θ

log L( θ ; y , X)

(FOC) ∂ log L( θ ; y , X)

∂θ = 0

(SOC) ∂ ² log L( θ ; y , X)

∂θ∂θ ⁰ is a negative definite matrix.

(16)

We obtain MLE of β and σ ² :

β ˜ = (X ⁰ X) ⁻ ¹ X ⁰ y , σ ˜ ² = (y − X ˜ β ) ⁰ (y − X ˜ β )

n ,

where ˜ σ ² is divided by n, not n − k.

4. Fisher’s information matrix is:

I( θ ) = − E ( ∂ ² log L( θ ; y , X)

∂θ∂θ ⁰ )

The inverse of the information matrix, I( θ ) ⁻ ¹ , provides a lower bound of the

(17)

variance - covariance matrix for unbiased estimators of θ . I( θ ) ⁻ ¹ =

( σ ² (X ⁰ X) ⁻¹ 0

0 2 σ ⁴

n )

For large n, we approximately obtain:

( β ˜ σ ˜ ²

)

∼ N (( β

σ ² )

,

( σ ² (X ⁰ X) ⁻ ¹ 0

0 2 σ ⁴

n ))

.

(18)

9.3 MLE: The Case of Multiple Regression Model II

1. Regression model: y = X β + u, u ∼ N(0 , σ ² Ω ) Transformation of Variables from u to y:

f _u (u) = (2 πσ ² ) ⁻ ⁿ ^/ ² |Ω| ⁻ ¹ ^/ ² exp (

− 1

2 σ ² u ⁰ Ω ⁻ ¹ u ) f _y (y) = f _u (y − X β ) ∂ u

∂ y ⁰

= (2 πσ ² ) ⁻ ⁿ ^/ ² |Ω| ⁻ ¹ ^/ ² exp (

− 1

2 σ ² (y − X β ) ⁰ Ω ⁻ ¹ (y − X β ) )

= L( θ ; y , X) ,

(19)

where θ = ( β, σ ² ), because of ∂ u

∂ y ⁰ = I _n . The log-likelihood function is:

log L( θ ; y , X) = − n

2 log(2 πσ ² ) − 1

2 log |Ω| − 1

2 σ ² (y − X β ) ⁰ Ω ⁻ ¹ (y − X β ) , where θ = ( β, σ ² ).

2. max

θ

log L( θ ; y , X)

(FOC) ∂ log L( θ ; y , X)

∂θ = 0

(SOC) ∂ ² log L( θ ; y , X)

∂θ∂θ ⁰ is a negative definite matrix.

(20)

Then, we obtain MLE of β and σ ² :

β ˜ = (X ⁰ Ω ⁻ ¹ X) ⁻ ¹ X ⁰ Ω ⁻ ¹ y , σ ˜ ² = (y − X ˜ β ) ⁰ Ω ⁻ ¹ (y − X ˜ β ) n

3. Fisher’s information matrix is defined as:

I( θ ) = − E ( ∂ ² log L( θ ; y , X)

∂θ∂θ ⁰ )

The inverse of the information matrix, I( θ ) ⁻ ¹ , provides a lower bound of the variance - covariance matrix for unbiased estimators of θ , which is given by:

I( θ ) ⁻ ¹ =

( σ ² (X ⁰ Ω ⁻ ¹ X) ⁻ ¹ 0

0 2 σ ⁴

n

)

(21)

9.4 MLE: AR(1) Model

The pth-order Autoregressive Model, i.e., AR(p) Model (p

次の自己回帰モデル

):

y _t = φ 1 y _t ₋ ₁ + φ 2 y _t ₋ ₂ + · · · + φ p y _t ₋ _p + u _t

AR(1) Model: t = 2 , 3 , · · · , n,

y _t = φ 1 y _t ₋ ₁ + u _t , u _t ∼ N(0 , σ ² )

where |φ 1 | < 1 is assumed for now.

(22)

To obtain the joint density function of y ₁ , y ₂ , · · · , y _n , f (y _n , y _n ₋ ₁ , · · · , y ₁ ) is decom- posed as follows:

f (y _n , y _n−1 , · · · , y ₁ ) = f (y ₁ )

∏ n

t = 2

f (y _t | y _t−1 , · · · , y ₁ ) . From y _t = φ 1 y _t ₋ ₁ + u _t , we can obtain:

E(y _t | y _t ₋ ₁ , · · · , y ₁ ) = φ 1 y _t ₋ ₁ , and V(y _t | y _t ₋ ₁ , · · · , y ₁ ) = σ ² . Therefore, the conditional distribution f (y t | y t − 1 , · · · , y 1 ) is:

f (y _t | y _t−1 , · · · , y ₁ ) = 1

√ 2 πσ ² exp (

− 1

2 σ ² (y _t − φ 1 y _t−1 ) ² )

.

(23)

To obtain the unconditional distribution f (y _t ), y _t is rewritten as follows:

y t = φ 1 y t − 1 + u t

= φ ² 1 y t − 2 + u t + φ 1 u t − 1

...

= φ ₁ ^j y t − j + u t + φ 1 u t − 1 + · · · + φ ₁ ^j u t − j

...

= u _t + φ 1 u _t ₋ ₁ + φ ² 1 u _t ₋ ₂ + · · · , when j goes to infinity.

(24)

The unconditional expectation and variance of y _t is:

E(y t ) = 0, and V(y t ) = σ ² (1 + φ ² ₁ + φ ⁴ ₁ + · · · ) = σ ² 1 − φ ² ₁ . Therefore, the unconditional distribution of y _t is given by:

f (y _t ) = 1

√

2 πσ ² / (1 − φ ² ₁ ) exp

(

− 1

2 σ ² / (1 − φ ² ₁ ) y ² _t )

.

(25)

Finally, the joint distribution of y ₁ , y ₂ , · · · , y _n is given by:

f (y _n , y _n ₋ ₁ , · · · , y ₁ ) = f (y ₁ )

∏ n

t=2

f (y _t | y _t ₋ ₁ , · · · , y ₁ )

= 1

√

2 πσ ² / (1 − φ ² ₁ ) exp

(

− 1

2 σ ² / (1 − φ ² ₁ ) y ² ₁ )

×

∏ n

t = 2

√ 1

2 πσ ² exp (

− 1

2 σ ² (y _t − φ 1 y _t ₋ ₁ ) ²

)

(26)

The log-likelihood function is:

log L( φ 1 , σ ² ; y _n , y _n ₋ ₁ , · · · , y ₁ ) = − 1

2 log(2 πσ ² / (1 − φ ² 1 )) − 1

2 σ ² / (1 − φ ² ₁ ) y ² ₁

− n − 1

2 log(2 πσ ² ) − 1 2 σ ²

∑ n

t = 2

(y _t − φ 1 y _t ₋ ₁ ) ² . Maximize log L with respect to φ 1 and σ ² .

Maximization Procedure:

• Newton-Raphson Method, or Method of Scoring

• Simple Grid Search (search maximization within the range − 1 < φ 1 < 1,

changing the value of φ 1 by 0.01)

(27)

9.5 MLE: Regression Model with AR(1) Error

When the error term is autocorrelated, the regression model is written as:

y _t = x _t β + u _t , u _t = ρ u _t ₋ ₁ + t , t ∼ iid N(0 , σ ² ) .

The joint distribution of u _n , u _n ₋ ₁ , · · · , u ₁ is:

f _u (u _n , u _n−1 , · · · , u ₁ ; ρ, σ ² ) = f _u (u ₁ ; ρ, σ ² )

∏ n

t = 2

f _u (u _t | u _t−1 , · · · , u ₁ ; ρ, σ ² )

(28)

= (2 πσ ² / (1 − ρ ² )) ⁻ ¹ ^/ ² exp (

− 1

2 σ ² / (1 − ρ ² ) u ² ₁ )

× (2 πσ ² ) ^−(n−1)/2 exp

 

 − 1

2 σ ²

∑ n

t = 2

(u _t − ρ u _t−1 ) ²

 

 .

By transformation of variables from u n , u n − 1 , · · · , u 1 to y n , y n − 1 , · · · , y 1 , the joint dis- tribution of y _n , y _n ₋ ₁ , · · · , y ₁ is:

f y (y n , y n − 1 , · · · , y 1 ; ρ, σ ² , β )

= f u (y n − x n β, y n − 1 − x n − 1 β, · · · , y 1 − x 1 β ; ρ, σ ² ) ∂ u

∂ y ⁰

(29)

= (2 πσ ² / (1 − ρ ² )) ⁻ ¹ ^/ ² exp (

− 1

2 σ ² / (1 − ρ ² ) (y ₁ − x ₁ β ) ² )

× (2 πσ ² ) ^−(n−1)/2 exp

 

 − 1

2 σ ²

∑ n

t = 2

( (y _t − ρ y _t−1 ) − (x _t − ρ x _t−1 ) β ) 2

 



= (2 πσ ² ) ⁻ ¹ ^/ ² (1 − ρ ² ) ¹ ^/ ² exp (

− 1 2 σ ² ( √

1 − ρ ² y ₁ − √

1 − ρ ² x ₁ β ) ² )

× (2 πσ ² ) ⁻ ⁽ⁿ ⁻ ¹⁾ ^/ ² exp

 

 − 1

2 σ ²

∑ n

t = 2

( (y t − ρ y t − 1 ) − (x t − ρ x t − 1 ) β ) 2

 



= (2 πσ ² ) ^−n/2 (1 − ρ ² ) ^1/2 exp (

− 1

2 σ ² (y ^∗ ₁ − x ^∗ ₁ β ) ² )

× exp

 

 − 1

2 σ ²

∑ n

t = 2

(y ^∗ _t − x ^∗ _t β ) ²

 



(30)

= (2 π ) ⁻ ⁿ ^/ ² ( σ ² ) ⁻ ⁿ ^/ ² (1 − ρ ² ) ¹ ^/ ² exp

 

 − 1

2 σ ²

∑ n

t = 1

(y ^∗ _t − x ^∗ _t β ) ²

 



= L( ρ, σ ² , β ; y _n , y _n−1 , · · · , y ₁ ) , where y ^∗ _t and x ^∗ _t are given by:

y ^∗ _t =  



√ 1 − ρ ² y _t , for t = 1,

y t − ρ y t − 1 , for t = 2 , 3 , · · · , n, x ^∗ _t =  



√ 1 − ρ ² x _t , for t = 1,

x t − ρ x t − 1 , for t = 2 , 3 , · · · , n,

(31)

◎

For maximization, the first derivative of L( ρ, σ ² , β ; y _n , y _n ₋ ₁ , · · · , y ₁ ) with respect to β should be zero.

β ˜ = (

∑ T

t = 1

x ^∗ _t ⁰ x ^∗ _t ) ⁻ ¹ (

∑ T

t = 1

x ^∗ _t ⁰ y ^∗ _t )

= (X ^∗0 X ^∗ ) ⁻ ¹ X ^∗0 y ^∗

= ⇒ This is equivalent to OLS from the regression model: y ^∗ = X ^∗ β + and ∼

N(0 , σ ² I n ), where σ ² = σ ² / (1 − ρ ² ).

(32)

◎

For maximization, the first derivative of L( ρ, σ ² , β ; y _n , y _n ₋ ₁ , · · · , y ₁ ) with respect to σ ² should be zero.

σ ˜ ² = 1 n

∑ n

t = 1

(y ^∗ _t − x ^∗ _t β ) ² = 1

n (y ^∗ − X ^∗ β ) ⁰ (y ^∗ − X ^∗ β ) , where

y ^∗ =

 





y ^∗ ₁ y ^∗ ₂ ...

y ^∗ _n

 



 =

 





√ 1 − ρ ² y ₁ y 2 − ρ y 1

...

y n − ρ y n − 1

 



 , X ^∗ =

 





x ^∗ ₁ x ^∗ ₂ ...

x ^∗ _n

 



 =

 





√ 1 − ρ ² x ₁ x 2 − ρ x 1

...

x n − ρ x n − 1

 



 .

(33)

◎

For maximization, the first derivative of L( ρ, σ ² , β ; y _n , y _n ₋ ₁ , · · · , y ₁ ) with respect to ρ should be zero.

max β,σ

²

,ρ L( ρ, σ ² , β ; y) is equivalent to max

ρ L( ρ, σ ˜ ² , β ˜ ; y).

L( ρ, σ ˜ ² , β ˜ ; y) is called the concentrated log-likelihood function (

集約対数尤度関数

), which is a function of ρ , i.e., both ˜ σ ² and ˜ β depend only on ρ .

(34)

The log-likelihood function is written as:

log L( ρ, σ ˜ ² , β ˜ ; y) = − n

2 log(2 π ) − n

2 log( ˜ σ ² ) + 1

2 log(1 − ρ ² ) − n 2

= − n

2 log(2 π ) − n 2 − n

2 log ( σ ˜ ² ( ρ ) )

+ 1

2 log(1 − ρ ² )

For maximization of log L, use Newton-Raphson method, method of scoring or simple grid search

Note that ˜ σ ² = σ ˜ ² ( ρ ) = 1

n (y ^∗ − X ^∗ β ˜ ) ⁰ (y ^∗ − X ^∗ β ˜ ) for ˜ β = (X ^∗0 X ^∗ ) ⁻ ¹ X ^∗0 y ^∗ .

Let ˜ θ be MLE of θ .

4. Asymptotic Normality of MLE:

Let ˜ θ be MLE of θ .

As n goes to infinity, we have the following result:

√ n(˜ θ − θ ) −→ N

  0 , lim

n →∞

( I( θ ) n

) −1 

 , where it is assumed that lim

n →∞

( I( θ ) n

)

converges.

That is, when n is large, ˜ θ is approximately distributed as follows:

θ ˜ ∼ N (

θ, (I( θ )) −1 )

.

Suppose that s(X) = θ ˜ .

When n is large, V(s(X)) is approximately equal to (I( θ )) − 1 . 5. Optimization (

):

MLE of θ results in the following maximization problem:

max

θ

log L( θ ; x) .

We often have the case where the solution of θ is not derived in closed form.

= ⇒ Optimization procedure 0 = ∂ log L( θ ; x)

∂θ = ∂ log L( θ ∗ ; x)

∂θ + ∂ 2 log L( θ ∗ ; x)

∂θ∂θ 0 ( θ − θ ∗ ) . Solving the above equation with respect to θ , we obtain the following:

θ = θ ∗ −

( ∂ 2 log L( θ ∗ ; x)

∂θ∂θ 0

) − 1

∂ log L( θ ∗ ; x)

∂θ .

Replace the variables as follows:

θ −→ θ (i + 1) θ ∗ −→ θ (i)

Then, we have:

θ (i+1) = θ (i) −

( ∂ 2 log L( θ (i) ; x)

∂θ∂θ 0

) − 1

∂ log L( θ (i) ; x)

∂θ .

= ⇒ Newton-Raphson method (

)

Replacing ∂ 2 log L( θ (i) ; x)

∂θ∂θ 0 by E

( ∂ 2 log L( θ (i) ; x)

∂θ∂θ 0 )

, we obtain the following op- timization algorithm:

θ (i + 1) = θ (i) − (

E

( ∂ 2 log L( θ (i) ; x)

∂θ∂θ 0

)) − 1

∂ log L( θ (i) ; x)

∂θ

= θ (i) + (

I( θ (i) ) ) − 1 ∂ log L( θ (i) ; x)

∂θ

= ⇒ Method of Scoring (

)

9.1 MLE: The Case of Single Regression Model

The regression model:

y i = β 1 + β 2 x i + u i , 1. u i ∼ N(0 , σ 2 ) is assumed.

2. The density function of u i is:

f (u i ) = 1

√ 2 πσ 2 exp (

− 1 2 σ 2 u 2 i

) .

Because u 1 , u 2 , · · · , u n are mutually independently distributed, the joint den-

sity function of u 1 , u 2 , · · · , u n is written as:

f (u 1 , u 2 , · · · , u n ) = f (u 1 ) f (u 2 ) · · · f (u n )

= 1

(2 πσ 2 ) n / 2 exp

 

 − 1

2 σ 2

) ₋₁ 

θ, (I( θ )) ⁻¹ )

When n is large, V(s(X)) is approximately equal to (I( θ )) ⁻ ¹ . 5. Optimization (

∂θ = ∂ log L( θ ^∗ ; x)

∂θ + ∂ ² log L( θ ^∗ ; x)

∂θ∂θ ⁰ ( θ − θ ^∗ ) . Solving the above equation with respect to θ , we obtain the following:

θ = θ ^∗ −

( ∂ ² log L( θ ^∗ ; x)

∂θ∂θ ⁰

) ₋ 1

∂ log L( θ ^∗ ; x)

θ −→ θ ⁽ⁱ ⁺ ¹⁾ θ ^∗ −→ θ ⁽ⁱ⁾

θ ⁽ⁱ⁺¹⁾ = θ ⁽ⁱ⁾ −

( ∂ ² log L( θ ⁽ⁱ⁾ ; x)

∂θ∂θ ⁰

) ₋ 1

∂ log L( θ ⁽ⁱ⁾ ; x)

Replacing ∂ ² log L( θ ⁽ⁱ⁾ ; x)

∂θ∂θ ⁰ by E

( ∂ ² log L( θ ⁽ⁱ⁾ ; x)

∂θ∂θ ⁰ )

θ ⁽ⁱ ⁺ ¹⁾ = θ ⁽ⁱ⁾ − (

( ∂ ² log L( θ ⁽ⁱ⁾ ; x)

∂θ∂θ ⁰

)) ₋ 1

∂ log L( θ ⁽ⁱ⁾ ; x)

= θ ⁽ⁱ⁾ + (

I( θ ⁽ⁱ⁾ ) ) ₋ 1 ∂ log L( θ ⁽ⁱ⁾ ; x)

y _i = β 1 + β 2 x _i + u _i , 1. u i ∼ N(0 , σ ² ) is assumed.

2. The density function of u _i is:

√ 2 πσ ² exp (

− 1 2 σ ² u ² _i

Because u ₁ , u ₂ , · · · , u _n are mutually independently distributed, the joint den-

sity function of u ₁ , u ₂ , · · · , u _n is written as:

(2 πσ ² ) ⁿ ^/ ² exp

2 σ ²

u ² _i

3. Using the transformation of variable (u _i = y _i − β 1 − β 2 x _i ), the joint density function of y ₁ , y ₂ , · · · , y _n is given by:

(2 πσ ² ) ⁿ ^/ ² exp

2 σ ²

(y i − β 1 − β 2 x i ) ²

≡ L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n ) .

L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n ) is called the likelihood function.

log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n ) is called the log-likelihood function.

log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n )

2 log( σ ² ) − 1 2 σ ²

(y _t − β 1 − β 2 x _i ) ²

Suppose that the density function of a random variable X is f _x (x).

Defining X = g(Y), the density function of Y, f _y (y), is given by:

f _y (y) = f _x (g(y)) dg(y) dy

∂ y ⁰

∂ y ⁰ .

f _x (x) = 1

Therefore, the density function of Y, f _y (y), is given by:

f _y (y) = dx dy

f _x (g(y)) = | − exp( − y) | = exp( − y)

5. Given the observed data y ₁ , y ₂ , · · · , y _n , the likelihood function L( β 1 , β 2 , σ ² | y ₁ , y 2 , · · · , y n ), or the log-likelihood function log L( β 1 , β 2 , σ ² | y 1 , y 2 , · · · , y n ) is maximized with respect to ( α , β , σ ² ).

∂ log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n )

σ ²

(y _i − β 1 − β 2 x _i ) = 0 ,

∂ log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n )

σ ²

(y _i − β 1 − β 2 x _i )x _i = 0 ,

∂ log L( β 1 , β 2 , σ ² | y ₁ , y ₂ , · · · , y _n )

∂σ ² = − n

2 1 σ ² + 1

2 σ ⁴

(y _i − β 1 − β 2 x _i ) ² = 0 .

The solutions of ( β 1 , β 2 , σ ² ) are called the maximum likelihood estimates, denoted by ( ˜ β 1 , ˜ β 2 , ˜ σ ² ).

i = 1 (x _i − x)(y _i − y)