Important Matrix Expressions βb = ¡ - The Algebra of Least Squares

The Algebra of Least Squares

Theorem 3.2 Important Matrix Expressions βb = ¡

X⁰X¢₋₁¡ X⁰Y¢ be=Y −Xβb X⁰be=0.

Early Use of Matrices

The earliest known treatment of the use of matrix methods to solve simultaneous systems is found in Chapter 8 of the Chinese textThe Nine Chapters on the Mathematical Art, written by sev- eral generations of scholars from the 10^{t h}to 2^ndcentury BCE.

3.11 Projection Matrix

Define the matrix

P=X¡ X⁰X¢₋1

X⁰. Observe that

P X=X¡ X⁰X¢₋1

X⁰X=X.

This is a property of a projection matrix. More generally, for any matrix Z which can be written as Z=XΓfor some matrixΓ(we say thatZ lies in therange spaceofX), then

P Z=P XΓ=X¡

X⁰X¢−1X⁰XΓ=XΓ=Z.

As an important example, if we partition the matrix X into two matrices X₁ and X₂ so thatX = [X1 X2] thenP X1=X1. (See Exercise 3.7.)

The projection matrixPhas the algebraic property that it isidempotent:P P=P. See Theorem 3.3.2 below. For the general properties of projection matrices see Section A.11.

The matrixPcreates the fitted values in a least squares regression:

P Y =X¡ X⁰X¢₋₁

X⁰Y =Xβb=Yb. Because of this propertyPis also known as thehat matrix.

A special example of a projection matrix occurs whenX =1nis ann-vector of ones. Then P=1n

¡1⁰_n1n

¢₋₁ 1⁰_n=1

n1n1⁰_n. Note that in this case

P Y =1n¡

1⁰_n1n¢₋₁

1⁰_nY =1nY creates ann-vector whose elements are the sample meanY.

The projection matrixP appears frequently in algebraic manipulations in least squares regression.

The matrix has the following important properties.

Theorem 3.3 The projection matrixP=X¡ X⁰X¢₋₁

X⁰for anyn×kX withn≥ khas the following algebraic properties.

1. Pis symmetric (P⁰=P).

2. Pis idempotent (P P=P).

3. trP=k.

4. The eigenvalues ofPare 1 and 0. There arekeigenvalues equalling 1 and n−kequalling 0.

5. rank (P)=k.

We close this section by proving the claims in Theorem 3.3. Part 1 holds since P⁰=³

X¡ X⁰X¢−1

X⁰´₀

=¡ X⁰¢0³

¡X⁰X¢−1´₀ (X)⁰

=X³¡

X⁰X¢₀´₋1

X⁰

=X³ (X)⁰¡

X⁰¢₀´₋1

X⁰=P. To establish part 2, the fact thatP X=X implies that

P P=P X¡ X⁰X¢₋₁

X⁰=X¡ X⁰X¢₋₁

X⁰=P as claimed. For part 3,

trP=tr³ X¡

X⁰X¢−1

X⁰´

=tr³

¡X⁰X¢−1

X⁰X´

=tr (Ik)=k.

See Appendix A.5 for definition and properties of the trace operator.

For part 4, it is shown in Appendix A.11 that the eigenvaluesλi of an idempotent matrix are all 1 and 0. Since trP equals the sum of then eigenvalues and trP =k by part 3, it follows that there arek eigenvalues equalling 1 and the remaindern−kequalling 0.

For part 5, observe that P is positive semi-definite since its eigenvalues are all non-negative. By Theorem A.4.5 its rank equals the number of positive eigenvalues, which iskas claimed.

3.12 Annihilator Matrix

Define

M=In−P=In−X¡ X⁰X¢−1

X⁰ whereInis then×nidentity matrix. Note that

M X =(In−P)X=X−P X=X−X =0. (3.21) ThusMandX are orthogonal. We callMtheannihilator matrixdue to the property that for any matrix Z in the range space ofX then

M Z=Z−P Z=0.

For example,M X1=0 for any subcomponentX1ofX, andM P=0 (see Exercise 3.7).

The annihilator matrixMhas similar properties withP, including thatMis symmetric (M⁰=M) and idempotent (M M=M). It is thus a projection matrix. Similarly to Theorem 3.3.3 we can calculate

trM=n−k. (3.22)

(See Exercise 3.9.) One implication is that the rank ofMisn−k.

WhilePcreates fitted values,Mcreates least squares residuals:

M Y =Y −P Y =Y −Xβb=be. (3.23)

As discussed in the previous section, a special example of a projection matrix occurs whenX =1nis ann-vector of ones, so thatP=1n¡

1⁰_n1n¢₋1

1⁰_n. The associated annihilator matrix is M=I_n−P=I_n−1n¡

1⁰_n1n¢−1

1⁰_n. WhilePcreates a vector of sample means,Mcreates demeaned values:

M Y=Y −1nY.

For simplicity we will often write the right-hand-side asY −Y. Thei^{t h}element isY_i−Y, thedemeaned value ofYi.

We can also use (3.23) to write an alternative expression for the residual vector. SubstitutingY = Xβ+eintobe=M Y and usingM X =0we find

be=M Y =M¡

Xβ+e¢

=Me (3.24)

which is free of dependence on the regression coefficientβ.

3.13 Estimation of Error Variance

The error varianceσ²=E£ e²¤

is a moment, so a natural estimator is a moment estimator. Ife_iwere observed we would estimateσ²by

σe²=1 n

i=1

e²_i. (3.25)

However, this is infeasible aseiis not observed. In this case it is common to take a two-step approach to estimation. The residualsebiare calculated in the first step, and then we substituteebiforeiin expression (3.25) to obtain the feasible estimator

σb²=1 n

i=1

eb²_i. (3.26)

In matrix notation, we can write (3.25) and (3.26) asσe²=n⁻¹e⁰eand

σb²=n⁻¹be⁰be. (3.27)

Recall the expressionsbe=M Y =Mefrom (3.23) and (3.24). Applied to (3.27) we find

σb²=n⁻¹be⁰be=n⁻¹e⁰M Me=n⁻¹e⁰Me (3.28) the third equality sinceM M=M.

An interesting implication is that

σe²−σb²=n⁻¹e⁰e−n⁻¹e⁰Me=n⁻¹e⁰Pe≥0.

The final inequality holds becauseP is positive semi-definite ande⁰Peis a quadratic form. This shows that the feasible estimatorσb²is numerically smaller than the idealized estimator (3.25).

3.14 Analysis of Variance

Another way of writing (3.23) is

Y =P Y+M Y =Yb+be. (3.29)

This decomposition isorthogonal, that is

Yb⁰be=(P Y)⁰(M Y)=Y⁰P M Y =0. (3.30) It follows that

Y⁰Y =Yb⁰Yb+2Yb⁰be+be⁰be=Yb⁰Yb+be⁰be

or n

i=1

Y_i²=

i=1

Yb_i²+

i=1

eb_i². SubtractingY from both sides of (3.29) we obtain

Y−1nY =Yb−1nY+be.

This decomposition is also orthogonal whenX contains a constant, as

Yb−1nY´₀

be=Yb⁰be−Y1⁰_nbe=0 under (3.17). It follows that

Y −1nY´₀³

Y −1nY´

=³

Yb−1nY´₀³

Yb−1nY´ +be⁰be

or n

i=1

Yi−Y´2

i=1

Ybi−Y´2

i=1

eb²_i.

This is commonly called theanalysis-of-varianceformula for least squares regression.

A commonly reported statistic is thecoefficient of determinationorR-squared:

R²= Pn

i=1

Ybi−Y´2

Pn i=1

³Y_i−Y´2=1− Pn

i=1eb²_i Pn

i=1

³Y_i−Y´2.

It is often described as “the fraction of the sample variance ofY which is explained by the least squares fit”.R²is a crude measure of regression fit. We have better measures of fit, but these require a statistical (not just algebraic) analysis and we will return to these issues later. One deficiency withR²is that it in- creases when regressors are added to a regression (see Exercise 3.16) so the “fit” can be always increased by increasing the number of regressors.

The coefficient of determination was introduced by Wright (1921).

3.15 Projections

One way to visualize least squares fitting is as a projection operation.

Write the regressor matrix asX =[X1X2... Xk] whereXj is the j^{t h}column ofX. The range space R(X) ofX is the space consisting of all linear combinations of the columnsX1,X2,...,X_k. R(X) is ak

dimensional surface contained inRⁿ. If k=2 thenR(X) is a plane. The operatorP =X¡ X⁰X¢₋₁

X⁰ projects vectors ontoR(X). The fitted valuesYb=P Y are the projection ofY ontoR(X).

To visualize examine Figure 3.3. This displays the casen=3 andk=2. Displayed are three vectors Y,X1, andX2, which are each elements ofR³. The plane created byX1andX2is the range spaceR(X).

Regression fitted values are linear combinations ofX₁andX₂and so lie on this plane. The fitted value Yb is the vector on this plane closest toY. The residualbe=Y −Yb is the difference between the two. The angle between the vectorsYb andbeis 90^◦, and therefore they are orthogonal as shown.

X₁ X2

e^ Y

X₁ X2

e^ Y

X₁ X2

e^ Y

X₁ X2

e^ Y

X₁ X2

e^ Y

X₁ X2

e^ Y

X₁ X2

e^ Y

X₁

Figure 3.3: Projection ofY ontoX1andX2

3.16 Regression Components

PartitionX =[X1 X2] andβ=(β1,β2). The regression model can be written as

Y =X1β1+X2β2+e. (3.31)

The OLS estimator ofβ=(β⁰₁,β⁰₂)⁰is obtained by regression ofY onX =[X1X2] and can be written as Y =Xβb+be=X1βb1+X2βb2+be. (3.32) We are interested in algebraic expressions forβb1andβb2.

Let’s first focus onβb1. The least squares estimator by definition is found by the joint minimization

¡ βb1,βb2¢

=argmin

β1,β2

SSE¡ β1,β2¢

(3.33) where

SSE¡ β1,β2¢

=¡

Y −X1β1−X2β2¢0¡

Y −X1β1−X2β2¢ .

An equivalent expression forβb1can be obtained by concentration (nested minimization). The solution (3.33) can be written as

βb1=argmin

β1

µ minβ2

SSE¡ β1,β2¢

. (3.34)

The inner expression min_β₂SSE¡ β1,β2¢

minimizes overβ2while holdingβ1 fixed. It is the lowest pos- sible sum of squared errors givenβ1. The outer minimization argmin_β₁ finds the coefficientβ1which minimizes the “lowest possible sum of squared errors givenβ1”. This means thatβb1as defined in (3.33) and (3.34) are algebraically identical.

Examine the inner minimization problem in (3.34). This is simply the least squares regression of Y −X1β1onX2. This has solution

argmin

β2

SSE¡ β1,β2¢

=¡

X⁰₂X2¢−1¡ X⁰₂¡

Y−X1β1¢¢

with residuals

Y −X1β1−X2¡

X⁰₂X2¢−1¡ X⁰₂¡

Y−X1β1¢¢

=¡

M2Y −M2X1β1¢

=M₂¡

Y −X₁β1

¢ where

M₂=I_n−X₂¡

X⁰₂X₂¢₋₁

X⁰₂ (3.35)

is the annihilator matrix forX₂. This means that the inner minimization problem (3.34) has minimized value

minβ2

SSE¡ β1,β2¢

=¡

Y −X1β1¢0

M2M2¡

Y−X1β1¢

=¡

Y −X1β1¢₀ M2¡

Y −X1β1¢

where the second equality holds sinceM2is idempotent. Substituting this into (3.34) we find βb1=argmin

β1

¡Y −X₁β1

¢₀ M₂¡

Y−X₁β1

=¡

X⁰₁M₂X₁¢−1¡

X⁰₁M₂Y¢ . By a similar argument we find

βb2=¡

X⁰₂M1X2¢−1¡

X⁰₂M1Y¢ where

M1=In−X1¡

X⁰₁X1¢₋1

X⁰₁ (3.36)

is the annihilator matrix forX1.

Theorem 3.4 The least squares estimator¡ βb1,βb2¢

for (3.32) has the algebraic solution

βb1=¡

X⁰₁M₂X₁¢₋₁¡

X⁰₁M₂Y¢

(3.37) βb2=¡

X⁰₂M₁X₂¢₋₁¡

X⁰₂M₁Y¢

(3.38) whereM1andM2are defined in (3.36) and (3.35), respectively.

3.17 Regression Components (Alternative Derivation)*

An alternative proof of Theorem 3.4 uses an algebraic argument based on the population calculations from Section 2.22. Since this is a classic derivation we present it here for completeness.

PartitionQb_{X X} as

Qb_{X X}=





Qb₁₁ Qb₁₂ Qb₂₁ Qb₂₂



=





 1 nX⁰₁X1

1 nX⁰₁X2

1 nX⁰₂X1

1 nX⁰₂X2





 and similarlyQb_{X Y} as

Qb_{X Y} =



 Qb_1Y Qb_2Y



=





 1 nX⁰₁Y 1 nX⁰₂Y





 .

By the partitioned matrix inversion formula (A.3)

Qb⁻¹_{X X}=





Qb₁₁ Qb₁₂ Qb₂₁ Qb₂₂





−1 def=







Qb¹¹ Qb¹² Qb²¹ Qb²²





=







Qb⁻¹_11·2 −Qb⁻¹_11·2Qb₁₂Qb⁻¹₂₂

−Qb⁻¹_22·1Qb₂₁Qb⁻¹₁₁ Qb⁻¹_22·1





 (3.39)

whereQb_11·2=Qb₁₁−Qb₁₂Qb⁻¹₂₂Qb₂₁andQb_22·1=Qb₂₂−Qb₂₁Qb⁻¹₁₁Qb₁₂. Thus βb=

µ βb1

βb2

Qb⁻¹₁₁_·₂ −Qb⁻¹₁₁_·₂Qb₁₂Qb⁻¹₂₂

−Qb⁻¹₂₂_·₁Qb₂₁Qb⁻¹₁₁ Qb⁻¹₂₂_·₁

#· Qb_1Y Qb_2Y

= Ã

Qb⁻_11·2¹ Qb_1Y_·2 Qb⁻_22·1¹ Qb_2Y_·₁

! . Now

Qb_11·2=Qb₁₁−Qb₁₂Qb⁻¹₂₂Qb₂₁

= 1

nX⁰₁X₁−1 nX⁰₁X₂

µ1 nX⁰₂X₂

¶₋₁ 1 nX⁰₂X₁

= 1

nX⁰₁M₂X₁ and

Qb_1y_·2=Qb_1Y−Qb₁₂Qb⁻¹₂₂Qb_2Y

= 1

nX⁰₁Y −1 nX⁰₁X2

µ1 nX⁰₂X2

¶₋1 1 nX⁰₂Y

= 1

nX⁰₁M2Y. Equation (3.38) follows.

Similarly to the calculation forQb_11·2 andQb_1Y_·2 you can show thatQb_2Y_·1= 1

nX⁰₂M1Y andQb_22·1= 1

nX⁰₂M1X2. This establishes (3.37). Together, this is Theorem 3.4.

3.18 Residual Regression

As first recognized by Frisch and Waugh (1933) and extended by Lovell (1963), expressions (3.37) and (3.38) can be used to show that the least squares estimatorsβb1andβb2can be found by a two-step regression procedure.

Take (3.38). SinceM1is idempotent,M1=M1M1and thus βb2=¡

X⁰₂M1X2¢−1¡

X⁰₂M1Y¢

=¡

X⁰₂M1M1X2¢₋1¡

X⁰₂M1M1Y¢

=³ Xe⁰₂Xe2

´₋1³ Xe⁰₂ee1

´ whereXe2=M1X2andee1=M1Y.

Thus the coefficient estimatorβb2is algebraically equal to the least squares regression ofee1onXe2. No- tice that these two areY andX₂, respectively, premultiplied byM₁. But we know that pre-multiplication byM1creates least squares residuals. Thereforeee1is simply the least squares residual from a regression ofY onX1, and the columns ofXe2are the least squares residuals from the regressions of the columns of X₂onX₁.

We have proven the following theorem.

ドキュメント内 PDF ECONOMETRICS - Keio (ページ 93-100)