• 検索結果がありません。

Isolet data

ドキュメント内 k519 fulltext (ページ 60-87)

The last example deals with the Isolet data studied in Weinberger et al. (2006), which are available at http://archive.ics.uci.edu/ml/datasets/ISOLET. The data have 26 classes corresponding to the letters of the alphabet. There are 617 genes and and a total

Table 7.2: Results for the SRBCT data. Numbers in bold show common selected genes in each approach.

selected genes

m-fair 1955 2050 1954 1194 1158 174 1003 1389 246 107 1645 951 1980

m-nacc 1955 481 1158 1954 1194 1888 951 879 1003 174 1389 246

509 107 867 879 1708 1955 2050 mrmr 1194 246 742 1003 1389 819 851

338 368 1706 1319 2 545

m-fair m-nacc mrmr m-dlda

No. of selected genes 13 12 20 2308

Training error 0/63 0/63 0/63 1/63

Test error 2/20 3/20 2/20 5/20

of 7797 samples: 6238 training samples (238 samples from class 6, and 240 samples from each of the other classes), and 1559 test samples (59 samples from class 13, and 60 samples from each of the other classes).

As our setting is that ofhdlss, we randomly picked 20 samples form each class of the training data. We applied each of the four methods considered in Subsection 7.2 to the 520 samples of the new training data, and evaluated their performances on the full test data consisting of 1559 samples. We repeated the above procedure 100 times. Boxplots of the test errors and the number of selected features are shown in Figure 7.1. The left panel of Figure 7.1 shows the number of selected features. Figure 7.1 exhibits that the number of selected features of m-fair is smaller than that of the other approaches. However, there seems to be an unstable trend in the misclassification rate arising from them-fair calculations, as can be observed in the right panel. This could be a consequence of the small number of selected features. On the other hand, the number of selected features of m-nacc and mrmr result in about the same number, however, the test error of m-nacc is smaller than that for mrmr and m-fair. The detailed values of the boxplots are summarized in Table 7.3. From these values we can see that the average test error of

m-nacc is almost equal to that of m-dlda, butm-nacc obtains the same accuracy with only one-third of the number of features.

M−FAIR M−NACC MRMR

100200300400

No. of selected features

M−FAIR M−NACC MRMR M−DLDA

0.10.20.30.40.50.6

Test error

Figure 7.1: Isolet data. The left panel shows boxplots of the number of selected features for m-nacc, m-fair and mrmr over 100 iterations. m-fairis smallest, and the number of its outliers (open circles) is zero. The number of outliers for m-nacc is two, and that of mrmr is six. The right panel shows boxplots of test errors ofm-nacc,m-fair,mrmr and m-dlda using the simulated data of the left panel. m-fair is not as stable as the other approaches because there are many outliers and large errors. m-nacc and m-dlda are superior to other approaches.

Table 7.3: Results for the Isolet data. Descriptive statistics of the number of selected features and test error for each method over 100 iterations.

No. of selected features

m-fair m-nacc mrmr m-dlda

SD 43.69 33.38 52.31 0.00

min 25.00 139.00 25.00 617.00

1st quartile 74.00 166.80 171.80 617.00 median 102.00 181.50 189.50 617.00 average 99.33 185.60 205.10 617.00 3rd quartile 130.50 198.00 227.20 617.00 max 197.00 420.00 368.00 617.00

Test error

m-fair m-nacc mrmr m-dlda SD 0.13361 0.01483 0.04152 0.01242 min 0.12190 0.11030 0.13410 0.10780 1st quartile 0.15190 0.13710 0.16340 0.14110 median 0.16900 0.14820 0.17670 0.14820 average 0.22500 0.14770 0.18200 0.14850 3rd quartile 0.20510 0.15590 0.19500 0.15520 max 0.61390 0.18410 0.52020 0.18220

8 Proofs

Proof of Lemma 4.1

By Condition A, for fixed ℓ≤K we have b

µµb = (µµ) + (εε) +

K ℓ=1

(n n −π

) µ, where ε = (1/n)∑n

i=1εℓiand ε= (1/n)∑K

ℓ=1

n

i=1εℓi. Thus,

Mc0N1/2−M0Π1/2 = M0(N1/2Π1/2) +Mf(N Π)1K1TKN1/2+E0N1/2,

where Mf= [µ1 · · · µK ] andE0 = [ε1ε · · · εKε].

Therefore,CbTCb can be written as CbTCb

= (

N1/2Π1/2 )

M0TDb1M0 (

N1/2Π1/2 )

+ (

N1/2Π1/2 )

M0TDb1Mf(NΠ)1K1TKN1/2 +

(

N1/2Π1/2 )

M0TDb1E0N1/2+ (

N1/2Π1/2 )

M0TDb1MΠ1/2 +N1/21K1TK(N Π)MfTDb1M0

(

N1/2Π1/2 )

+N1/21K1TK(N Π)MfTDb−1Mf(N Π)1K1TKN1/2

+N1/21K1TK(N Π)MfTDb1E0N1/2+N1/21K1TK(NΠ)MfTDb1M0Π1/2 +N1/2E0TDb1M0

(

N1/2Π1/2 )

+N1/2E0TDb1Mf(NΠ)1K1TKN1/2 +N1/2E0TDb1E0N1/2+N1/2E0TDb1M0Π1/2

1/2M0TDb1M0 (

N1/2Π1/2 )

+ Π1/2M0Db1Mf(N Π)1K1TKN1/2

1/2M0TDb1E0N1/2+ Π1/2M0Db1M0Π1/2. (8.1) FromCondition B, it follows thatDb =D(1 +oP(1)) (see Fan and Fan (2008)), and this leads to the following expressions

M0TDb1M0 =(

kµ)TD1µ))

1k,ℓK(1 +oP(1)) =1K1TKO(Cd), M0TDb1Mf=(

kµ)TD1µ)

1k,ℓK(1 +oP(1)) =1K1TKO(Cd), MfTDb1Mf=(

µTkD1µ)

1k,ℓK(1 +oP(1)) =1K1TKO(Cdδ) +IKO(Cd)

by Condition E. From the evaluation of the termI3 on p.2626 of Fan and Fan (2008), we have

M0TDb1E0=1K1TKoP(Cd), MfTDb1E0 =1K1TKoP(Cd).

Consider the matrix E0TDb1E0 ofCbTC. We haveb E0TDb1E0 = E0TD1E0(1 +oP(1))

= (

kε)TD1ε))

1k,ℓK(1 +oP(1)).

In particular, we need to evaluate the variance term V [

kε)TD1ε)] . Ifk=ℓ, this variance can be obtained as

V [

ε)TD1ε)]

= tr{

(D−1⊗D−1)E[

ε)(εε)T ε)(εε)T]}

{

tr(D−1Σ)}2

(8.2)

by using Theorem 9.18 of Schott (1996), where is Kronecker product, that is, for A∈ Rm×n and B Rp×q,

A⊗B =





a11B a12B · · · a1nB a21B a22B · · · a2nB

... ... . .. ... am1B am2B · · · amnB





Rmp×nq,

and Σ =Vε] = (1/n1/n) Σ. Thus, we have {

tr(D1Σ)}2

=d2(1/n1/n)2. Since D1⊗D1 is a diagonal matrix, (8.2) can be written as

V [

ε)TD−1ε)]

= tr{

(D1⊗D1)E[ diag{

ε)(εε)T ε)(εε)T}]}

−d2 (1

n 1 n

)2

, by the property of the trace of the relevant matrix. The diagonal elements can be written as

D1⊗D1E[ diag{

ε)(εε)T ε)(εε)T}]

= diag (v1, . . . , vd2), (8.3) where

vj =











 E

[

ℓs−εs)4 ]

σ2ss , forj= (s1)d+s, s∈ {1, . . . , d}, E

[

ℓs−εs)2ℓt−εt)2 ] σssσtt

, for all other values of j < d2, and =t,

and εℓs and εs are sth element ofε andεrespectively.

Next, we expandεℓs−εs. This difference can be written as εℓs−εs = 1

n

n

i=1

εℓis 1 n

K k=1

nk

i=1

εkis

= ( 1

n 1 n

)∑n

i=1

εℓis 1 n

k̸=ℓ nk

i=1

εkis. Using the propertiesEℓsεks] =Eℓs]Eks] and Eℓs] = 0, we have

E [

ℓs−εs)4 ]

= ( 1

n 1 n

)4

E

 (n

i=1

εℓis )4

+ 1 n4E

∑

k̸=ℓ nk

i=1

εkis

4

+ 6 n2

( 1 n 1

n )2

E

 (n

i=1

εℓis

)2

E

∑

k̸=ℓ nk

i=1

εkis

2

.

In particular, we find that E

 ( n

i=1

εℓis )4

 = nE[ ε4ℓis]

+ 3n(n1){ E[

ε2ℓis]}2

= nξss+ 3n(n1)σ2ss,

E

∑

k̸=ℓ nk

i=1

εkis

4

 = ∑

k̸=ℓ

nkE[ ε4kℓs]

+ 3∑

k̸=ℓ

nk(nk1){ E[

ε2kℓs]}2

= (n−nss+ 3∑

k̸=ℓ

nk(nk1)σ2ss,

E

 ( n

i=1

εℓis )2

 = V [n

i=1

εℓis ]

= nσss,

E

∑

k̸=ℓ nk

i=1

εkis

2

 = V

∑

k̸=ℓ nk

i=1

εkis

 = (n−nss.

whereξst =E[

ε211sε211t]

. Therefore, the ((s1)d+s)th diagonal element of (8.3) becomes E

[

ℓs−εs)4 ]

= (n−n)(3n23nn+n2)

n3n3 ξss+ 3 n4

k̸=ℓ

nk(nk1)σss2

+3 (1

n 1 n

)2

(n−n)(n(n+ 1) +n(n1)) nn2 σss2 .

From tedious but direct calculations we have E

[( n

i=1

εℓis

) (n

i=1

εℓit

)]

=nσst,

E

 (n

i=1

εℓis )2

∑

k̸=ℓ nk

i=1

εkit

2

=n(n−nssσtt,

E

∑

k̸=ℓ nk

i=1

εkis

 (n

i=1

εℓis ) 

∑

k̸=ℓ nk

i=1

εkit

 (n

i=1

εℓit )

=n(n−nst2,

E

 (n

i=1

εℓis )2(n

i=1

εℓit )2

=nξst+n(n1)σssσtt+τσst2,

E

∑

k̸=ℓ nk

i=1

εkis

2

∑

k̸=ℓ nk

i=1

εkit

2

= (n−nst+ (n−n)(n−n1)σssσtt+τ−ℓσst2, whereτ andτ are the numbers of combinations that arose throughout the calculations, and whose orders are O(n2). The different expressions above lead to

E [

ℓs−εs)2ℓt−εt)2 ]

= ( 1

n 1 n

)4

E

 (n

i=1

εℓis )2(n

i=1

εℓit )2

+ ( 1

n 1 n

)2 1 n2E

 (n

i=1

εℓis )2

∑

k̸=ℓ nk

i=1

εkit

2

+4 (1

n 1 n

)2

1 n2E

∑

k̸=ℓ nk

i=1

εkis

 (n

i=1

εℓis ) 

∑

k̸=ℓ nk

i=1

εkit

 (n

i=1

εℓit )

+ ( 1

n 1 n

)2

1 n2E

∑

k̸=ℓ nk

i=1

εkis

2(n

i=1

εℓit )2

+ 1 n4E

∑

k̸=ℓ nk

i=1

εkis

2

∑

k̸=ℓ nk

i=1

εkit

2

= (n−n)(3n2 3nn+n2)

n3n3 ξst(n−n)(n2n+ 3n2 −nn23nn+n2)

n3n σssσtt+τ σst2,

where τ = (1/n1/n)4τ+ (1/n)4τ+ 4 (1/n1/n)2n(n−n)/n2. Combining the above calculations results in

V [

ε)TD−1ε)]

= O

(d2 n3

) +O

( 1 n2

) ∑

s,t

ρ2st,

where ρst is the (s, t) component of the correlation matrixR. The sum is evaluated as

s,t

ρ2st =1Td(R⊙R)1d≤λmax(R) {

1maxsdρss }

1Td1d≤b0d

by the definition of the parameter space Θ in (4.3), where is the Hadamard product;

that is, if Aand B arem×nmatrices, then

A⊙B =



a11b11 · · · a1nb1n ... . .. ... am1bm1 · · · amnbmn



.

Therefore, (8.2) can be evaluated as V [

ε)TD1ε)]

= O

(d2 n3

) . Using Chebyshev’s inequality, for any ε >0, we have

P (

ε)TD1ε)−E[

ε)TD1ε)] Cd

> ε

)

O ( d2

n3Cd2 )

= o(1).

Hence, (εε)TD1ε) can be evaluated asε)TD1ε) =

( 1 n 1

n )

d+oP(Cd).

Next, we evaluateV [

ε)TD1kε)]

forℓ̸=k. Using Theorems 7.7 and 7.14–

7.16 of Schott (1996), we get V [

ε)TD1kε)]

= tr{

(D1⊗D1)E[

diag{ε)(εkε)T ε)(εkε)T}]}

{

tr(D1E[

diag(εε)(εkε)T]}2

.

We first calculate the jth diagonal element of (εε)(εkε)T. By noting that εℓj−εj = 1

n n

i=1

εℓij 1 n

K k=1

nk

i=1

εkij

= ( 1

n 1 n

)∑n i=1

εℓij1 n

nk

i=1

εkij 1 n

h̸=ℓ,k nh

i=1

εhij, we have E[(εℓj −εj)(εkj−εj)] =−σjj/n.Consequently, we obtain

{tr(D1E[

diag(εε)(εkε)T]}2

= d2 n2. Next, we consider the diagonal matrix

(D1⊗D1)E[

diag{ε)(εkε)T ε)(εkε)T}]

= diag (u1, . . . , ud2), where

uj =















 E

[

ℓs−εs)2ks−εs)2 ]

σ2ss , j= (s1)d+sfors∈ {1, . . . , d}, E[(εℓs−εs) (εks−εs) (εℓt−εt) (εkt−εt)]

σssσtt , for all other values of j < d2, and =t.

If j= (s1)d+s, then we have E[

ℓs−εs)2ks−εs)2]

= n(n+nk)3nnk

nnkn3 ξss+κ(ℓ, k)σ2ss,

where κ(ℓ, k) is the coefficient of σ2ss. Note that the order of κ(ℓ, k) is O(1/n2) which we state here without giving a detailed proof. On the other hand, we have

E[(εℓs−εs) (εks−εs) (εℓt−εt) (εkt−εt)]

= n(n+nk)3nnk

nnkn3 ξst+ { 3

n3 1 n2

( 1 n + 1

nk )

+ 1 n2

}

σssσtt+τ σ2st when =k, whereτ =O(1/n2). From the above calculations, we have

V [

ε)TD1kε)]

= O

(d2 n3

) +O

( 1 n2

) ∑

s,t

ρ2st

= O

(d2 n3

) .

Chebyshev’s inequality now implies that (εε)TD1kε) = −d/n+oP(Cd), and consequently,

N1/2E0TDb1E0N1/2

=N1/2(

ε)TD1kε))

1ℓ,kKN1/2(1 +oP(1))

=N1/2 (

d (1

n 1 n

)

δℓ,k−d

n(1−δℓ,k) +oP(Cd) )

1ℓ,kK

N1/2(1 +oP(1))

= d

n(IK−N1/21K1TKN1/2) +1K1TKoP(Cd).

The previous calculations can now be summarized and lead to the desired expansion of CbTC/Cb d, namely

CbTCb

Cd = CTC Cd + d

nCd(IK−N1/21K1TKN1/2) +1K1TKoP(1)

= CTC

Cd +ξ(IKΠ1/21K1TKΠ1/2) +1K1TKoP(1).

Proof of Lemma 4.2

From Weyl’s inequality (see e.g. Bhatia (1997)),λα can be evaluated as max

{λα+1

Cd +ξ, λα Cd

}

λα Cd λα

Cd +ξ, (8.4)

forα= 1, . . . , K1 and 0≤λK/Cd≤λK/Cd+ξ=ξ. In particular, it follows from (8.4) that

λα+1

Cd +ξ < λα

Cd λα Cd+ξ

by Condition D. Therefore, λα/Cd should be simple.

Proof of Theorem 4.1

Put ΓK = [γ1, . . . ,γK], where γ is eigenvector of CTC/Cd+ξ(IK Π1/21K1TKΠ1/2) belonging to the ℓth largest eigenvalue. ByLemma 4.1, we obtain

ΓTKCbTCb Cd

ΓK= diag (λ1

Cd

, . . . ,λK Cd

)

(1 +oP(1)).

LetHb = [hb1 · · · hbK ], wherehb is eigenvector of ΓTK

(CbTC/Cb d )

ΓK belonging to theℓth largest eigenvalue. Since all eigenvaluesλα (forα = 1, . . . , K1) are simple byLemma 4.2, it follows that Hb −→P IK. From the equation ΓTK

(CbTC/Cb d )

ΓKhb = (bλ/Cd)hb we can see that

ΓTKCbTCb

Cd ΓKhb= bλ Cdhb

= CbTCb Cd

( ΓKhb

)

= bλ Cd

( ΓKhb

)

= CbCbT Cd

{ b

||Cγb ||(1 +oP(1)) }

= bλ Cd

{ b

||Cγb ||(1 +oP(1)) }

. (8.5)

On the other hand,

CbCbT

Cd bp= bλ

Cdbp (8.6)

follows from the definition in Subsection 4.1. Now, from (8.5), (8.6) and Lemma 4.2, we conclude that the linear span of thepbα is asymptotically equal to that of theb α/||Cγb α||. Since eigenvectors have unit length, ||pα|| = 1 and sgn (ˆpα1) = sgn

((b α/||Cγb α||)

1

) , where (·)1 denotes the first component of the vector. Therefore, we have

b

pα= b α

||Cγb α||(1 +oP(1)) = pbTα b α

||Cγb α|| = 1 +oP(1).

Proof of Theorem 4.2

FromTheorem4.1 and (4.7), the inner product ofpbα and pβ is given by b

pTαpβ = γTαCbTβ(1 +oP(1))

γTαCbTb α

γTβCTβ

= γTαΠ1/2Mc0TD1M0Π1/2γβ(1 +oP(1))

γTαCbTb α

γTβCTβ

. (8.7)

The numerator of (8.7) can be evaluated as

γTαΠ1/2Mc0TD1M0Π1/2γβ = γTαΠ1/2M0TD1M0Π1/2γβ(1 +oP(1))

= γTαCTβ(1 +oP(1))

by Chebyshev’s inequality. By Lemma 4.1, γTαCbTb α of (8.7) becomes γTαCbTb α = λα(1 +oP(1)) . Notice thatγTβCTβ of the denominator of (8.7) can be written as

γTβCTβ

=γTβ {

(CTC+Cdξ(IKΠ1/21K1TKΠ1/2))−Cdξ(IKΠ1/21K1TKΠ1/2) }

γβ

=λβ−Cdξ(1−γTβΠ1/21K1TKΠ1/2γβ).

Therefore, we obtain b

pTαpβ = κβδαβ −ξ(δαβ −ηαηβ)

√κα

κβ−ξ(1−ηβ2)

(1 +oP(1)). (8.8)

Proof of Corollary 4.1

From (8.8) in the proof of Theorem 4.2, and the assumptions of Corollary 4.1, it follows that

b

pTαpβ = κβδαβ

√κα√κβ

(1 +oP(1))

=

{ 1 +oP(1) ifα =β, oP(1) ifα ̸=β,

sinceξ 0.

Proof of Theorem 4.3

The inner product of bbα and bβ becomes bbαTbβ = bpTαDb1/2D1/2pβ

√ b

pTαDb1bpα

pTβD1pβ

= γTαCbTDb1/2D1/2β

γTαCbTDb−1b α

γTβCTD−1β

(1 +oP(1)). (8.9)

using Theorem4.1, (4.7) and (4.11). The numerator of (8.9) can be evaluated as γTαCbTDb1/2D1/2β =γTαCTD1β(1 +oP(1)).

Using (8.1), CbTDb1Cb of (8.9) is given by

CbTDb1Cb=CTD1C+N1/2ETDb2EN1/2+1K1TKo(Cd).

Therefore, we have γTαCbTDb1b α

γTαCTD1α+ 1

σminγTαN1/2ETDb1EN1/2γα(1 +oP(1)) +o(Cd)

=γTαCTD1α

× (

1 +Cdξ 1 σmin

1γTαΠ1/21K1TKΠ1/2γα

γαTCTD1α (1 +oP(1)) +o

( Cd γTαCTD1α

))

γTαCTD1α

× (

1 +Cdξσmax σmin

1γTαΠ1/21K1TKΠ1/2γα

γTαCTα (1 +oP(1)) +o

( Cd γTαCTα

))

=γTαCTD1α

× (

1 +Cdξσmax

σmin

1−η2α

λα−Cdξ(1−ηα2)(1 +oP(1)) +o

( Cd

λα−Cdξ(1−ηα2) ))

=γTαCTD1ακα−ξ(1−ηα2) (1−σmaxmin)

κα−ξ(1−η2α) (1 +oP(1)),

where σmax= max1jdσjj and σmin= min1jdσjj. Hence it follows that bbαTbβ bαTbβ

κα−ξ(1−η2α)

κα−ξ(1−ηα2) (1−σmaxmin)(1 +oP(1)). (8.10) Similarly, we obtain

bbαTbβ bαTbβ

κα−ξ(1−η2α)

κα−ξ(1−ηα2) (1−σminmax)(1 +oP(1)). (8.11)

Proof of Corollary 4.2

From the assumption d=o(nCd) that (8.10) and (8.11) can be evaluated as bbαTbβ bαTbβ

κα−ξ(1−ηα2)

κα−ξ(1−η2α) (1−σmaxmin)(1 +oP(1))

= bαTbβ(1 +oP(1)) and

bbαTbβ bαTbβ

κα−ξ(1−ηα2)

κα−ξ(1−η2α) (1−σminmax)(1 +oP(1))

= bαTbβ(1 +oP(1))

respectively. Therefore, we have bbαTbβ =bαTbβ(1 +oP(1)).

Evaluation of the Misclassification Rate W(bg, θ)

Suppose that the random vector X belongs to Ck. The correct classification rate of bgfor class Ck is defined as

Wk(bg, θ) = P( bg(X) =k |Xℓi, ℓ= 1, . . . , K; i= 1, . . . , n )

= P( bg(X) =k |X ). We have

Wk(bg, θ) = P

 ∩

α̸=k

{ ω

(

X(ω)1

2(µbk+µbα) )T

b w >0

} X

= P

 ∩

α̸=k

{

ω∈Ω bδ(X(ω))>0 }

X

,

where wb=BbT(BbDbBbT)1(µbkµbα). We can easily see that δb(X) N

((

µk1

2(µbk+µbα) )T

b

w, wbTΣwb )

, α̸=k.

Therefore, Wk(bg, θ) can be written as Wk(bg, θ) = P

 ∩

α̸=k

{

ω Zˆ(ω)>−dˆ

} X

,

where ˆZ=

(bδ(X)−E

[δb(X)

]) /√

V

[δb(X)

] ∼N(0,1) and

dˆ = E

[bδ(X) ]

V

[δb(X) ]

= (µk(µbk+µbα)/2)T BbT(BbDbBbT)1B(b µbkµbα)

√(bµkµbα)TBbT(BbDbBbT)1BBbT(BbDbBbT)1B(bb µkµbα)

. (8.12)

Next, we evaluate the (i, j)th element of the covariance matrix of ( ˆZk1, . . . ,ZˆkK)T, where i, j ∈ {1, . . . , K} − {k} and i ̸= j. From bδ(X) −E

[δb(X) ]

= (X µk)Twb, Cov( ˆZki,Zˆkj) can be written as

Cov( ˆZki,Zˆkj) = wbTkiΣwbkj

wbTkiΣwbki

wbTkjΣwbkj

.

Therefore, the covariance matrix of Zbk= ( ˆZk1, . . . ,Zˆk(K1))T is

Σbk=WckTΣcWk, (8.13)

where ˆZ=I(α < k) ˆZ+I≥k) ˆZk(α+1), Wck=

wbk1

wbTk1Σwbk1

· · · wbk(K1)

wbTk(K1)Σwbk(K1)

and wb =I(α < k)wb+I≥k)wbk(α+1). Now consider the region Dbk =

{

zRK1 zj <dˆ, α∈ {1, . . . , K1}} ,

where ˆd =I(α < k) ˆd+I(α≥k) ˆdk(α+1). SinceZkis also distributed asNK1(0,Σbk), the correct probability can be obtained as

Wk(bg, θ) = P

( K1

α=1

{

ω∈−Zˆ(ω)<dˆ} X

)

=

Dbk

√ 1

|2πΣbk| exp

(

1

2zTΣbk1z )

dz

= ΦK1

(Dbk; 0,Σbk )

.

Therefore, the misclassification rate of bg for classCk becomes Wk(bg, θ) = 1−Wk(bg, θ) = 1−ΦK1

(Dbk; 0,Σbk )

.

Proof of Theorem 4.4

By Theorem4.1, Bb is given by

Bb=Db1/2Pb =Db1Mc0N1/2ΓLb1(1 +oP(1)), whereLb= diag

(||Cγb 1||, . . . ,||Cγb K1||)

. UsingDb =D(1+oP(1)), (8.12) can be evaluated as

db = (µk(µbk+µbα)/2)T B(b BbTDbB)b 1BbT(µbkµbα)

(µbkµbα)TB(b BbTDbB)b 1BbTΣBb(BbTDbB)b 1BbT(µbkµbα)

1

λmax(R)

I1N1/2Γ(ΓTN1/2I2N1/2Γ)1ΓTN1/2I3T

I3N1/2Γ(ΓTN1/2I2N1/2Γ)1ΓTN1/2I3T

(1 +oP(1)),

where I1 = (µk(µbk+µbα)/2)T D1Mc0,I2 =Mc0TD1Mc0 and I3 = (µbkµbα)TD1Mc0. We first calculateI3. Note that I3 can be decomposed as

I3 = (µbkµbα)TD1Mc0

= [

(bµkµbα)TD1(µb1µ), . . . ,b (bµkµbα)TD1(µbKµ)b ]

. (8.14) FromCondition A, a typical component of (8.14) can be expressed as

(µbkµbα)TD1(µbµ)b

=

K h=1

nh n

[

kµα)TD1µh) + (εkεj)TDb1µh) +(µkµα)TD1εh) + (εkεα)TD1εh)]

. Then we have

kεα)TD1µh) =oP(

µh)TD1µh)) (µkµα)TD1εh) =oP

((µkµα)TD1kµα))

by p.2625 of Fan and Fan (2008). Next we examine∑K

h=1(nh/n)(εkεα)TD1εh), which can be written as

K h=1

nh

nkεα)TD1εh) = εTkD1εεTkD1εεTαD1ε+εTαD1ε.

By an argument similar to that given on p.2627 of Fan and Fan (2008), we obtain

K h=1

nh

nkεα)TD1εh) =























d nk +oP

(√d n

)

if=k,

d nα

+oP (√d

n )

if=α,

oP

(√d n

)

otherwise.

We also need to evaluate the asymptotic order of (µkµα)TD1M0, which can be written as

kµα)TD1M0

= [ K

ℓ=1

πkµα)TD11µ), . . . ,

K ℓ=1

πkµα)TD1Kµ) ]

=1TKΠF,

where F = [f1 · · · fK ] = ((µkµα)TD1iµj))1i,jK. UsingCondition E and Condition F,ℓth component of (µkµα)TD1M0 has the following form

1TKΠf

=



























Cd

h̸=k

√πhµTkD1µk Cd −√

παµαD1µα

Cd +∑

β̸=k

cβk

Cd

 if=k,

Cd

πkµkD−1µk Cd

+∑

h̸=α

√πhµTαD−1µα Cd

+∑

β̸=α

cβα Cd

 if=α,

Cd

πkµkD1µk Cd −√

πα

µαD1µα

Cd +∑

β̸=ℓ

cβℓ Cd

 otherwise.

=O(Cd),

where cβℓ=O(Cdζβℓ) andζβℓ(0,1) for all β, ℓ. Therefore, we havekµα)TD−1M0=1TKΠF =O(Cd)1TK. Using the above calculations, we have

(µbkµbα)TD1(µbµ)b

= (µkµα)TD1µ)(1 +oP(1)) +

K h=1

nh

nkεα)TD1εh) +oP

( max

h∈{1,...K}

{(µµh)TD1µh),(µkµα)TD1kµα)})

=















 {

kµα)TD1µ) + d nk

}

(1 +oP(1)) if=k, {

kµα)TD1µ) d nα

}

(1 +oP(1)) if=α,kµα)TD1µ)(1 +oP(1)) otherwise, by Condition D. Thus, it follows that

I3 = (

kµα)D1M0+β)

(1 +oP(1)),

whereβ= (0, . . . ,0, d/nk,0, . . . ,0,−d/nα,0, . . . ,0).Next, we considerI1. We find that I1 =

( µk1

2(µbk+µbα) )T

D1Mc0 =εTkD1Mc0+1

2(µbkµbα)TD1Mc0. (8.15) Similarly, (8.15) becomes

εTkD1Mc0+1

2(µbkµbα)TD1Mc0

=



















d n

( 1 n

nk )

+1 2

{

kµα)TD1µ) + d nk

}

(1 +oP(1)) if=k, d

n+ 1 2

{

kµα)TD1µ) d nα

}

(1 +oP(1)) if=α, d

n+ 1

2(µkµα)TD1µ)(1 +oP(1)) otherwise,

=







 [1

2(µkµα)TD1µ) +d n

( 1 n

2n )]

(1 +oP(1)) if=k, α, [1

2(µkµα)TD1µ) +d n ]

(1 +oP(1)) otherwise.

Therefore, we have I1 =

[1

2(µkµα)TD1M0+d nα

]

(1 +oP(1)), where

α= (

1, . . . ,1,1 n 2nk

,1, . . . ,1, ,1 n 2nα

,1, . . . ,1 )

. Finally, we consider I2. It can be written as

I2=Mc0D1Mc0=(

(µbαµ)b TD1(bµβµ)b )

1α,βK. (8.16) Each component of (8.16) can be decomposed as

(bµαµ)b TD1(µbβµ) =b {

αµ)TD1βµ) +J1+J2

}(1 +oP(1)) +J3, (8.17)

whereJ1 = (εαε)TD1βµ), J2 = (µαµ)TD1βε) andJ3 = (εαε)TD1β ε). From calculations similar to those carried out in the derivation ofI1 and I3, we get

J1 = oP(

βµ)TD1βµ))

, J2 = oP(

αµ)TD1αµ)) ,

J3 = (εαε)TD1βε) =











 d n

( n nα 1

) +oP

(√d n

)

ifα=β,

−d n+oP

(√d n

)

ifα̸=β.

Consequently, (8.17) results in (µbαµ)b TD1(bµβµ)b

=







 {

αµ)TD1αµ) +d n

( n nα 1

)}

(1 +oP(1)) ifα=β, {

αµ)TD1βµ) d n

}

(1 +oP(1)) ifα̸=β.

Therefore, we have I2 =

{

M0TD1M0+ d n

(N11K1TK)}

(1 +oP(1)).

In summary, the components ofdb can be evaluated as I1N1/2 =

[1

2(µkµα)TD1M0+ d nα

]

N1/2(1 +oP(1))

= [1

2M+ d

nsΠ1/2 ]

(1 +oP(1))

= S(1 +oP(1)), N1/2I2N1/2 = N1/2

[

M0TD1M0+ d n

(N11K1TK)]

N1/2(1 +oP(1))

= [

CTC+ d n

(

IKΠ1/21K1TKΠ1/2 )]

(1 +oP(1)), I3N1/2 = [

kµα)D1M0+β]

N1/2(1 +oP(1))

= [

M+ d

nqΠ1/2 ]

(1 +oP(1))

= Q(1 +oP(1)) sinceN = Π(1 +oP(1)). Therefore, we have

db

SΓ[ ΓT {

CTC+ (d/n)(

IKΠ1/21K1TKΠ1/2)}

Γ]1

ΓTQT(1 +oP(1))

λmax(R)

QΓ[

ΓT {

CTC+ (d/n)(

IKΠ1/21K1TKΠ1/2)}

Γ]1

ΓTQT

. (8.18)

This completes the proof of Theorem4.4.

Proof of Corollary 4.3

UsingM =1TKO(Cd) andCTC =1K1TKO(Cd),S of (8.18) becomes S = M

2 +Cd ( d

nCd )

sΠ1/2

= M

2 +1TKo(Cd)

= M

2 (1 +oP(1)).

Similarly, we can obtain Q=M(1 +oP(1)) and U =CTC(1 +oP(1)).

Proof of Theorem 4.6

The right side of (4.17) can be written as 1Φ

(√n1n2/(dn)αTD1α(1 +oP(1)) + (n1−n2)√

d/(nn1n2) 2√

λmax(R)√

1 +n1n2αTD1α(1 +oP(1))/(dn) )

= 1Φ

( √n1n2/(dn)αTD1α[

(1 +oP(1)) +{d/(nαTD1α)}{(n1−n2)/n1}] 2√

λmax(R)√

n1n2αTD1α/(dn)

(n/n1){d/(n2αTD1α)}+ (1 +oP(1)) )

= 1Φ

( √

n1n2/(dn)αTD1α[(1 +oP(1)) +o(1)O(1)]

2√

λmax(R)√

n1n2/(dn)√

αTD1α

O(1)o(1) + (1 +oP(1)) )

= 1Φ (

αTD1α 2√

λmax(R)(1 +oP(1)) )

by the assumption d=o(nCd). Therefore, we have W1(bg) = max

θΘW1(bg, θ) = 1−Φ ( Cd

2

b0(1 +oP(1)) )

.

Proof of Corollary 4.4

We derive (4.18) as follows:

1Φ

(√n1n2/(dn)αTD1α(1 +oP(1)) + (n1−n2)√

d/(nn1n2) 2√

λmax(R)√

1 +n1n2αTD1α(1 +oP(1))/(dn) )

= 1Φ (

αTD1α[(1 +oP(1)) +{d/(n2Cd)}(CdTD1α)(1−n2/n1)]

2√

λmax(R)√

(n/n1){d/(n2αTD1α)}+ (1 +oP(1))

)

= 1Φ (

αTD1α[(1 +oP(1)) +{d/(n2Cd)}(CdTD1α){1(c0+o(1))}] 2√

λmax(R)√

(n/n1){d/(n2αTD1α)}+ (1 +oP(1))

)

1Φ (

αTD1α[(1 +oP(1)) +{d/(n2Cd)}(CdTD1α)o(1)]

2√

λmax(R)√

(n/n1){d/(n2αTD1α)}+ (1 +oP(1)) )

= 1Φ (

αTD1α{(1 +oP(1)) +O(1)O(1)o(1)}

2√

λmax(R)√

O(1)O(1)O(1) + (1 +oP(1)) )

>1Φ (

αTD1α 2√

λmax(R)(1 +oP(1)) )

.

9 Conclusion

In this paper, we discussed the asymptotic theories of the multi-class linear discriminant function in ahdlsscontext. In Section 3, we constructed the linear discriminant function based on naive canonical correlation in the context of multi-class problem. In Section 4, we derived the asymptotic behavior of eigenvectors of the naive canonical correlation matrix corresponding to positive eigenvalues. In the asymptotic theory, both the dimension d and the sample size n grow, and provided d does not grow too fast, we showed that all eigenvectors and discriminant directions arehdlssconsistent. Under suitable conditions, we were able to derive an upper bound for the worst case misclassification rate in the multi-class setting. In Section 5, we proposed a feature selection method for hdlssdata, callednacc approach, using a discriminant direction. Further, for the general multi-class setting, we proposed and discussed two methods for feature selection, called m-nacc and m-fair, which extend their respective two-class analogues. If the variance is large relative to the difference between the means, we illustrate in Subsection 6.4 that nacc and m-nacc performed better thanfair andm-fairrespectively. Applications to real data sets demonstrate thatnacc and m-nacc performed well.

In recent years, other discriminant method forhdlssdata has been studied by many authors: Marron et al. (2007) proposed Distance-Weighted Discrimination (DWD) which improves the performance of Support Vector Machine (SVM) in the hdlss setting, Fan et al. (2012) proposed Regularized Optimal Affine Discriminant (ROAD) which uses the L1-constraint in the Fisher’s criterion (2.3), Ahn et al. (2012) proposed a hierarchical clustering algorithm based on the MDP distance by referring to Ahn and Marron (2010).

Other possible research directions include extensions of our theoretical results to the “ker-nel method” in linear discrimination described in Mika et al. (1999).

Our approach exploits the naive Bayes rule and replacesΣb1 by the diagonal matrix Db−1. On the other hand, replacing Db−1 by a certain type of band matrices could also yield efficient linear discriminant functions in ahdlsssetting. Such discriminant functions

are of interest in practice, especially when relevant correlation information between the observations is lost in the replacement of Σb−1 by the diagonal matrix Db−1. Theoretical research of k0-banded matrix has been considered in Bickel and Levina (2008), and their results are expected to apply to linear discriminant function in hdlss settings. Further-more, we can explore issues of discriminant function based on invertiblek0-banded matrix:

asymptotic behavior of misclassification rate, the selection criteria ofk0, the algorithm for preprocessing correlation ofhdlss data before discrimination.

Acknowledgements

First and foremost, I would like to thank my supervisor Prof. Kanta Naito for all of his help and lots of encouragement to me during the bachelor’s, the master’s and doctoral courses. I would also like to thank Prof. Inge Koch for her many valuable comments and suggestions. Prof. Toshihiro Nakanishi and Prof. Daishi Kuroiwa gave me helpful comments about the draft of this thesis.

This work was supported by Grant-in-Aid for Japan Society for the Promotion of Science (JSPS) Fellows. I am grateful to the JSPS for making this study possible by the financial.

Finally, I would like to thank my father and mother for their understanding, support and encouragement during this study.

References

Ahn, J., Lee, M. H. and Yoon, Y. J. (2012). Clustering high dimension, low sample size data using the maximal data piling distance,Statistica Sinica, Vol. 22, p. 443.

Ahn, J. and Marron, J. S. (2010). The maximal data piling direction for discrimination, Biometrika, Vol. 97, pp. 254–259.

Ahn, J., Marron, J. S., Muller, K. M. and Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions,Biometrika, Vol. 94, pp. 760–766.

Aoshima, M. and Yata, K. (2011). Two-stage procedures for high-dimensional data, Se-quential analysis, Vol. 30, pp. 356–399.

Bhatia, R. (1997). Matrix Analysis: Springer, New York.

Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant func-tion, “naive Bayes” and some alternatives when there are many more variables than observations,Bernoulli, Vol. 10, pp. 989–1010.

Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices, The Annals of Statistics, Vol. 36, pp. 199–227.

Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning, Vol. 1: springer New York.

Devroye, L., Gy¨orfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recogni-tion: Springer.

Ding, C. and Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data,Journal of Bioinformatics and Computational Biology, Vol. 3, pp.

185–205.

Duda, R. O., Hart, P. E. and Stork, D. G. (2012). Pattern classification: John Wiley &

Sons.

Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data, Journal of the American statistical association, Vol. 97, pp. 77–87.

Fan, J. and Fan, Y. (2008). High-dimensional classification using features annealed inde-pendence rules,The Annals of Statistics, Vol. 36, pp. 2605–2637.

Fan, J., Feng, Y. and Tong, X. (2012). A road to classification in high dimensional space:

the regularized optimal affine discriminant, Journal of the Royal Statistical Society:

Series B (Statistical Methodology), Vol. 74, pp. 745–771.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics, Vol. 7, pp. 179–188.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. (1999). Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,science, Vol. 286, pp. 531–537.

Gordon, G. J., Jensen, R. V., Hsiao, L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma,Cancer Research, Vol. 62, pp. 4963–4967.

Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning:

Springer.

Johnstone, I. M. (2001). On the distribution of the largest principal component, The Annals of Statistics, Vol. 29, pp. 295–327.

Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context,The Annals of Statistics, Vol. 37, pp. 4104–4130.

Khan, J., Wei, J., Ringn´er, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks,Nat. Med., Vol. 7, pp. 673–679.

Koch, I. and Naito, K. (2010). Prediction of multivariate responses with a selected number of principal components,Computational Statistics &Data Analysis, Vol. 54, pp. 1791–

1807.

Mardia, K. V., Kent, J. and Bibby, J. (1979).Multivariate Analysis: Academic Press.

Marron, J. S., Todd, M. J. and Ahn, J. (2007). Distance-weighted discrimination,Journal of the American Statistical Association, Vol. 102, pp. 1267–1271.

Mika, S., R¨atsch, G., Weston, J. and Sch¨olkopf, B. (1999). Fisher discriminant analysis with kernels,IEEE Neural Networks for Signal Processing Workshop, pp. 41–48.

Peng, H., Long, F. and Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 27, pp. 1226–1238.

Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classification, Journal of the Royal Statistical Society. Series B (Methodological), Vol.

10, pp. 159–203.

Saeys, Y., Inza, I. and Larranaga, P. (2007). A review of feature selection techniques in bioinformatics,Bioinformatics, Vol. 23, pp. 2507–2517.

Schott, J. (1996). Matrix Analysis for Statistics: New York: Wiley.

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoff, P. W.,

ドキュメント内 k519 fulltext (ページ 60-87)

関連したドキュメント