in canonical correlation analysis

(1)

正準相関分析における冗長性モデルに対する情報量規準 Information criteria for redundancy models

in canonical correlation analysis

数学専攻神田真吾

Shingo Kanda

1 Introduction

In discriminant analysis Rao’s [8] additional information hypothesis is known as a hypothesis concerning the relevance of a specified variable subset. Likewise, in canonical correlation analysis Siotani, Hayakawa and Fujikoshi [9] has formulated. This paper deals with the problem of selecting the best subsets of variables under the hypothesis, and let the model under an additional information hypothesis be a redundancy model.

We shall consider estimating the covariance matrix structure because a redundancy models is composed by some covariance matrix structure.

2 One set of redundancy model

Let x u = (x u1 , . . . , x up ) ⁰ and x v = (x v1 , . . . , x vq ) ⁰ be two random vectors, distributed as a joint (p+q)-variate normal distribution with the following means and a covariance matrix;

µ = µ µ _u

µ _v

¶ , Σ =

µ Σ uu Σ uv

Σ vu Σ vv

¶ .

Let ¯ x u , x ¯ v and S be the sample means and covariance matrix formed from samples of size N = n+1 observations X u = [x u1 , . . . , x uN ], X v = [x v1 , . . . , x vN ]

on x u and x v . Now, our problem is to find the best subset of {x u1 , . . . , x up } in the situation where we want to predict x v by x u , and we consider the case of x ⁰ ₁ = {x u1 , . . . , x uk } ⊂ {x u1 , . . . , x up }. Corresponding to a partition of x u = (x ⁰ ₁ , x ⁰ ₂ ) ⁰ , we use

µ _u = µ µ ₁

µ ₂

¶

, Σ uu =

µ Σ 11 Σ 12

Σ 21 Σ 22

¶

, Σ uv = µ Σ 1v

Σ 2v

¶ .

Similar notations for submatrix of S are also used. We shall consider the selection method based on AIC, and introduce a redundancy model M k by considering the hypothesis that x 2 has no additional information about x v , in presence of x 1 , i.e,

M k : Σ 2v·1 = 0,

where µ

Σ vv·1 Σ v2·1

Σ 2v·1 Σ 22·1

¶

=

µ Σ vv Σ v2

Σ 2v Σ 22

¶

− µ Σ v1

Σ 21

¶

Σ ⁻¹ ₁₁ (Σ 1v Σ 21 ).

Let ˆ µ and ˆ Σ k are the unbiased estimators of µ and Σ under M k respectively. Using their estimators, it is known that

R _k = −n log Ã¯ ¯

¯ ¯ S vv·1 S v2·1

S 2v·1 S 22·1

¯ ¯

¯ ¯ ,

{|S _vv·1 | · |S _22·1 |}

! + b _k . Suppose E ⁽⁰⁾ is the expected value for the true model M 0 , then the bias term

b k = E ⁽⁰⁾ [n tr ˆ Σ ⁻¹ _k Σ] − n(p + q), (1) where we note that

tr ˆ Σ ⁻¹ _k Σ = tr

µ S 11 S 12

S 21 S 22

¶ ₋₁ µ

Σ 11 Σ 12

Σ 21 Σ 22

¶ + tr

µ S 11 S 1v

S v1 S vv

¶ ₋₁ µ

Σ 11 Σ 1v

Σ v1 Σ vv

¶

− trS ₁₁ ⁻¹ Σ 11 .

1

(2)

Therefore, AIC and M AIC result in the following:

AIC k = L(S, Σ ˆ k ) + (p + q)(p + q + 1) − 2q(p − k).

If a candidate model includes the true model, M AIC k = L(S, Σ ˆ k ) − n(p + q) + n ²

µ k + q

n − k − q − 1 + p

n − p − 1 − k n − k − 1

¶ , is known by Fujikoshi [4].

However, since these criteria seem to be biased as estimators for their risk when the true model is distributed as nonnormal, we now consider EIC which Ishiguro, Sakamoto and Kitagawa [6] suggested.

Let X _u ^∗ , X _v ^∗ be the bootstrap sample generated according to the emprical distribution G of X u , X v , X _u ^∗ =

µ X ₁ ^∗ X ₂ ^∗

¶

=

µ x ^∗ ₁₁ , . . . , x ^∗ _1N x ^∗ ₂₁ , . . . , x ^∗ _2N

¶

, X _v ^∗ = [x ^∗ _v1 , . . . , x ^∗ _vN ].

Then, the candidate model is

M ^∗ :



 x ^∗ _1j x ^∗ _2j x ^∗ _vj





0 ∼ i.i.d. G, j = 1, . . . N.

Here, let the bootstrap estimators ˆ µ ^∗ and ˆ Σ ^∗ _k for µ and Σ. By replacing Σ and ˆ Σ k with the maximum likelihood estimator (n/N)S for Σ and the bootstrap estimator S ^∗ with respect to (1), we shall derive the bias term for EIC ,

˜ b k (i) = E ⁽⁰⁾ [D ^∗ (i)]

= E ⁽⁰⁾ [L( n

N S, Σ ˆ ^∗ _k ) − L(S ^∗ , Σ ˆ ^∗ _k )]

= n ² N E ⁽⁰⁾

"

tr

µ S ₁₁ ^∗ S ₁₂ ^∗ S ₂₁ ^∗ S ₂₂ ^∗

¶ ₋₁ µ

S 11 S 12

S 21 S 22

¶

+ tr

µ S ₁₁ ^∗ S _1v ^∗ S _v1 ^∗ S _vv ^∗

¶ ₋₁ µ

S 11 S 1v

S v1 S vv

¶

− trS ₁₁ ^∗−1 S 11

#

− n(p + q), where i corresponds for ith bootstraap sample of B ones.Then

b ^∗ _k ≈ 1 B

X B

i=1

D ^∗ (i), where B is a number of bootstrap. Therefore,

EIC k = L(S, Σ ˆ k ) + b ^∗ _k

3 Two sets of redundancy models

Likewise Section2, let x u = (x u1 , . . . , x up ) ⁰ and x v = (x v1 , . . . , x vq ) ⁰ be two random vectors, distributed as a joint (p + q)-variate normal distribution with the following means and a covariance matrix;

µ = µ µ _u

µ _v

¶ , Σ =

µ Σ uu Σ uv

Σ vu Σ vv

¶

In order to formulate two sets of redundancy models, we partition x u and x v as x u =

µ x 1

x 2

¶ , x v =

µ x 3

x 4

¶ ,

2

(3)

where x ⁰ ₁ = {x u1 , . . . , x ur

1

} ⊂ {x u1 , . . . , x up }, x ⁰ ₃ = {x v1 , . . . , x vr

2

} ⊂ {x v1 , . . . , x vq }. Conformably, µ µ _u

µ _v

¶

=



 

 µ ₁ µ ₂ µ ₃ µ ₄



 

 , Σ =



 



Σ ₁₁ Σ ₁₂ Σ ₁₃ Σ ₁₄ Σ 21 Σ 22 Σ 23 Σ 24

Σ 31 Σ 32 Σ 33 Σ 34

Σ 41 Σ 42 Σ 43 Σ 44



 

 .

As we assume x 2 and x 4 are redundant for x v and x u respectively, candidate models are M r : Σ 2v·1 = 0, Σ 4u·3 = 0.

Then the conditional distribution of (x ⁰ ₂ , x ⁰ ₄ ) ⁰ given (x ⁰ ₁ , x ⁰ ₃ ) ⁰ is a (p + q − r)-variate normal distribution with mean vector

E

"µ x 2

x 4

¶ ¯¯ ¯

¯ ¯ µ x 1

x 3

¶#

= µ µ ₂

µ ₄

¶ +

µ B 21 B 23

B 41 B 43

¶ µ x 1 − µ ₁ x 3 − µ ₃

¶

and covariance matrix

V

"µ x 2

x 4

¶ ¯¯ ¯

¯ ¯ µ x 1

x 3

¶#

=

µ Σ 22·13 Σ 24·13

Σ 42·13 Σ 44·13

¶ ,

where r = r 1 + r 2 ,

B =

µ B 21 B 23

B 41 B 43

¶

=

µ Σ 21 Σ 23

Σ 41 Σ 43

¶ µ Σ 11 Σ 13

Σ 31 Σ 33

¶ ₋₁

µ Σ 22·13 Σ 24·13

Σ 42·13 Σ 44·13

¶

=

µ Σ 22 Σ 24

Σ 42 Σ 44

¶

−

µ Σ 21 Σ 23

Σ 41 Σ 43

¶ µ Σ 11 Σ 13

Σ 31 Σ 33

¶ ₋₁ µ

Σ 12 Σ 14

Σ 32 Σ 34

¶ .

The redundancy model M r can be expressed in term of the conditional set-up as M r : B 23 = 0, B 41 = 0, Σ 24·13 = 0, Σ 42·13 = 0.

Let ˆ µ and ˆ Σ r be the unbiased estimators of µ and Σ under M r respectively. Then, we have R r = −n log

Ã¯ ¯

¯ ¯ S 22·13 S 24·13

S 42·13 S 44·13

¯ ¯

¯ ¯ ,

{|S 22·1 | · |S 44·3 |}

! + b r . Moreover the bias term

b r = E ⁽⁰⁾ [n tr ˆ Σ ⁻¹ _r Σ] − n(p + q), (2) where we can transform as the following;

tr ˆ Σ ⁻¹ _r Σ = tr

µ S 11 S 13

S 31 S 33

¶ ₋₁ µ

Σ 11 Σ 13

Σ 31 Σ 33

¶ + tr

µ S 11 S 12

S 21 S 22

¶ ₋₁ µ

Σ 11 Σ 12

Σ 21 Σ 22

¶

+ tr

µ S 33 S 34

S 43 S 44

¶ ₋₁ µ

Σ 33 Σ 34

Σ 43 Σ 44

¶

− trS ₁₁ ⁻¹ Σ 11 − trS ₃₃ ⁻¹ Σ 33 . Therefore, AIC and M AIC result the following:

AIC _r = L(S, Σ ˆ _r ) + (p + q)(p + q + 1) − 2pq − 2r ₁ r ₂ . If a candidate model includes the true model from Fujikoshi [5],

M AIC r = L(S, Σ ˆ r ) − n(p + q) +n ²

µ r 1 + r 2

n − r 1 − r 2 − 1 + p

n − p − 1 + q

n − q − 1 − r 1

n − r 1 − 1 − r 2

n − r 2 − 1

¶ .

Let X _u ^∗ , X _v ^∗ be the bootstrap sample generated according to the emprical distribution G of X u , X v , X _u ^∗ =

µ X ₁ ^∗ X ₂ ^∗

¶

=

µ x ^∗ ₁₁ , . . . , x ^∗ _1N x ^∗ ₂₁ , . . . , x ^∗ _2N

¶

, X _v ^∗ = µ X ₃ ^∗

X ₄ ^∗

¶

=

µ x ^∗ ₃₁ , . . . , x ^∗ _3N x ^∗ ₄₁ , . . . , x ^∗ _4N

¶

.

3

(4)

Then, the candidate model is

M ^∗ :



 

 x ^∗ _1j x ^∗ _2j x ^∗ _3j x ^∗ _4j



 



0 ∼ i.i.d. G j = 1, . . . N.

Here, the bootstrap estimators ˆ µ ^∗ and ˆ Σ ^∗ _r of µ and Σ are given as the following; By replacing Σ and ˆ Σ r

with the maximum likelihood estimator (n/N)S for Σ and the bootstrap estimator S ^∗ with respect to (2), we shall derive the bias term for EIC

˜ b r (i) = E ⁽⁰⁾ [D ^∗ (i)]

= E ⁽⁰⁾ [L( n

N S, Σ ˆ ^∗ _r ) − L(S ^∗ , Σ ˆ ^∗ _r )]

= n ² N E ⁽⁰⁾

"

tr

µ S ₁₁ ^∗ S ₁₃ ^∗ S ₃₁ ^∗ S ₃₃ ^∗

¶ ₋₁ µ

S 11 S 13

S 31 S 33

¶ + tr

µ S ₁₁ ^∗ S ₁₂ ^∗ S ₂₁ ^∗ S ₂₂ ^∗

¶ ₋₁ µ

S 11 S 12

S 21 S 22

¶

+ tr

µ S ^∗ ₃₃ S ₃₄ ^∗ S ^∗ ₄₃ S ₄₄ ^∗

¶ ₋₁ µ

S 33 S 34

S 43 S 44

¶

− trS ₁₁ ^∗−1 S 11 − trS ^∗−1 ₃₃ S 33

#

− n(p + q).

where i corresponds for ith bootstraap sample of B ones.Then b ^∗ _r ≈ 1

B X B

i=1

D ^∗ (i), where B is a number of bootstrap. Therefore,

EIC r = L(S, Σ ˆ r ) + b ^∗ _r

4 Simulation

We attempt to give an impression of the relative performances of AIC, M AIC and EIC. So we simulate these information criteria for some setting, and inspect them about an approximation with a true risk and about the probability for each model selected.

References

[1] Akaike,H.(1973). Information theory and an extension of the maximum likelihood principle. 2nd Interna- tional Symposium on Information Theory, Eds.B.N.Petrov and F.Cs´ aki,pp.267-281, Budapest: Akad´ emia Kiado.

[2] Anderson,T.W.(2003). An Introduction to Multivariate Statistical Analysis, Wiley Interscience, 3rd Edition.

[3] Fujikoshi,Y.(1982). A test for additional information in canonical correlation analysis. Ann. Inst. Statist.

Part A, 34, 523-530.

[4] Fujikoshi,Y.(1985). Selection of variables in discriminant analysis and canonical correlation analysis. Mul- tivariate Analysis, -VI,Ed. P.R. Krishnaian, 219-236, Elsevier Science Publishers B.V.

[5] Fujikoshi,Y.(2007). Corrected AIC for selecting of variables in canonical correlation analysis and some conditional independence structures. Submitted for publication.

[6] Ishiguro,M., Sakamoto,Y. and Kitagawa,G.(1996). Bootstrapping log-likelihood and EIC, an extension of AIC. Inst. Statist. Math..

[7] Konishi,S. and Kitagawa,G.(1996). Generalised information criteria in model selection. Biometrika, 83, 4, pp. 875-890.

[8] Rao,C.R.(1973). Linear Statistical Inference and Its Applications, John Wiley, New York.

[9] Siotani,M., Hayakawa,T. and Fujikoshi,Y.(1985). Modern Multivariate Statical Analysis: A Graduate Course and Handbook, American Sciences Press, Ohio.

4

in canonical correlation analysis

正準相関分析における冗長性モデルに対する情報量規準 Information criteria for redundancy models