Isolet data - k519 fulltext

The last example deals with the Isolet data studied in Weinberger et al. (2006), which are available at http://archive.ics.uci.edu/ml/datasets/ISOLET. The data have 26 classes corresponding to the letters of the alphabet. There are 617 genes and and a total

Table 7.2: Results for the SRBCT data. Numbers in bold show common selected genes in each approach.

selected genes

m-fair 1955 2050 1954 1194 1158 174 1003 1389 246 107 1645 951 1980

m-nacc 1955 481 1158 1954 1194 1888 951 879 1003 174 1389 246

509 107 867 879 1708 1955 2050 mrmr 1194 246 742 1003 1389 819 851

338 368 1706 1319 2 545

m-fair m-nacc mrmr m-dlda

No. of selected genes 13 12 20 2308

Training error 0/63 0/63 0/63 1/63

Test error 2/20 3/20 2/20 5/20

of 7797 samples: 6238 training samples (238 samples from class 6, and 240 samples from each of the other classes), and 1559 test samples (59 samples from class 13, and 60 samples from each of the other classes).

As our setting is that ofhdlss, we randomly picked 20 samples form each class of the training data. We applied each of the four methods considered in Subsection 7.2 to the 520 samples of the new training data, and evaluated their performances on the full test data consisting of 1559 samples. We repeated the above procedure 100 times. Boxplots of the test errors and the number of selected features are shown in Figure 7.1. The left panel of Figure 7.1 shows the number of selected features. Figure 7.1 exhibits that the number of selected features of m-fair is smaller than that of the other approaches. However, there seems to be an unstable trend in the misclassiﬁcation rate arising from them-fair calculations, as can be observed in the right panel. This could be a consequence of the small number of selected features. On the other hand, the number of selected features of m-nacc and mrmr result in about the same number, however, the test error of m-nacc is smaller than that for mrmr and m-fair. The detailed values of the boxplots are summarized in Table 7.3. From these values we can see that the average test error of

m-nacc is almost equal to that of m-dlda, butm-nacc obtains the same accuracy with only one-third of the number of features.

M−FAIR M−NACC MRMR

100200300400

No. of selected features

M−FAIR M−NACC MRMR M−DLDA

0.10.20.30.40.50.6

Test error

Figure 7.1: Isolet data. The left panel shows boxplots of the number of selected features for m-nacc, m-fair and mrmr over 100 iterations. m-fairis smallest, and the number of its outliers (open circles) is zero. The number of outliers for m-nacc is two, and that of mrmr is six. The right panel shows boxplots of test errors ofm-nacc,m-fair,mrmr and m-dlda using the simulated data of the left panel. m-fair is not as stable as the other approaches because there are many outliers and large errors. m-nacc and m-dlda are superior to other approaches.

Table 7.3: Results for the Isolet data. Descriptive statistics of the number of selected features and test error for each method over 100 iterations.

No. of selected features

m-fair m-nacc mrmr m-dlda

SD 43.69 33.38 52.31 0.00

min 25.00 139.00 25.00 617.00

1st quartile 74.00 166.80 171.80 617.00 median 102.00 181.50 189.50 617.00 average 99.33 185.60 205.10 617.00 3rd quartile 130.50 198.00 227.20 617.00 max 197.00 420.00 368.00 617.00

Test error

m-fair m-nacc mrmr m-dlda SD 0.13361 0.01483 0.04152 0.01242 min 0.12190 0.11030 0.13410 0.10780 1st quartile 0.15190 0.13710 0.16340 0.14110 median 0.16900 0.14820 0.17670 0.14820 average 0.22500 0.14770 0.18200 0.14850 3rd quartile 0.20510 0.15590 0.19500 0.15520 max 0.61390 0.18410 0.52020 0.18220

8 Proofs

Proof of Lemma 4.1

By Condition A, for ﬁxed ℓ≤K we have b

µ_ℓ−µb = (µ_ℓ−µ) + (ε_ℓ−ε) +

∑K ℓ=1

(n_ℓ n −π_ℓ

) µ_ℓ, where ε_ℓ = (1/n_ℓ)∑_n_ℓ

i=1ε_ℓiand ε= (1/n)∑_K

ℓ=1

∑_n_ℓ

i=1ε_ℓi. Thus,

Mc₀N^1/2−M₀Π^1/2 = M₀(N^1/2−Π^1/2) +Mf(N −Π)1_K1^T_KN^1/2+E₀N^1/2,

where Mf= [µ₁ · · · µ_K ] andE₀ = [ε₁−ε · · · ε_K−ε].

Therefore,Cb^TCb can be written as Cb^TCb

= (

N^1/2−Π^1/2 )

M₀^TDb⁻¹M₀ (

N^1/2−Π^1/2 )

+ (

N^1/2−Π^1/2 )

M₀^TDb⁻¹Mf(N−Π)1K1^T_KN^1/2 +

(

N^1/2−Π^1/2 )

M₀^TDb⁻¹E₀N^1/2+ (

N^1/2−Π^1/2 )

M₀^TDb⁻¹MΠ^1/2 +N^1/21K1^T_K(N −Π)Mf^TDb⁻¹M0

(

N^1/2−Π^1/2 )

+N^1/21_K1^T_K(N −Π)Mf^TDb⁻¹Mf(N −Π)1_K1^T_KN^1/2

+N^1/21K1^T_K(N −Π)Mf^TDb⁻¹E0N^1/2+N^1/21K1^T_K(N−Π)Mf^TDb⁻¹M0Π^1/2 +N^1/2E₀^TDb⁻¹M₀

(

N^1/2Π^1/2 )

+N^1/2E₀^TDb⁻¹Mf(N−Π)1_K1^T_KN^1/2 +N^1/2E₀^TDb⁻¹E0N^1/2+N^1/2E₀^TDb⁻¹M0Π^1/2

+Π^1/2M₀^TDb⁻¹M₀ (

N^1/2−Π^1/2 )

+ Π^1/2M₀Db⁻¹Mf(N −Π)1_K1^T_KN^1/2

+Π^1/2M₀^TDb⁻¹E0N^1/2+ Π^1/2M0Db⁻¹M0Π^1/2. (8.1) FromCondition B, it follows thatDb =D(1 +oP(1)) (see Fan and Fan (2008)), and this leads to the following expressions

M₀^TDb⁻¹M₀ =(

(µ_k−µ)^TD⁻¹(µ_ℓ−µ))

1≤k,ℓ≤K(1 +o_P(1)) =1_K1^T_KO(C_d), M₀^TDb⁻¹Mf=(

(µ_k−µ)^TD⁻¹µ_ℓ)

1≤k,ℓ≤K(1 +oP(1)) =1K1^T_KO(Cd), Mf^TDb⁻¹Mf=(

µ^T_kD⁻¹µ_ℓ)

1≤k,ℓ≤K(1 +o_P(1)) =1_K1^T_KO(C_d^δ) +I_KO(C_d)

by Condition E. From the evaluation of the termI₃ on p.2626 of Fan and Fan (2008), we have

M₀^TDb⁻¹E0=1K1^T_KoP(Cd), Mf^TDb⁻¹E0 =1K1^T_KoP(Cd).

Consider the matrix E₀^TDb⁻¹E0 ofCb^TC. We haveb E₀^TDb⁻¹E0 = E₀^TD⁻¹E0(1 +oP(1))

= (

(ε_k−ε)^TD⁻¹(ε_ℓ−ε))

1≤k,ℓ≤K(1 +o_P(1)).

In particular, we need to evaluate the variance term V [

(ε_k−ε)^TD⁻¹(ε_ℓ−ε)] . Ifk=ℓ, this variance can be obtained as

V [

(εℓ−ε)^TD⁻¹(εℓ−ε)]

= tr{

(D⁻¹⊗D⁻¹)E[

(ε_ℓ−ε)(ε_ℓ−ε)^T ⊗(ε_ℓ−ε)(ε_ℓ−ε)^T]}

−{

tr(D⁻¹Σ^∗)}2

(8.2)

by using Theorem 9.18 of Schott (1996), where ⊗ is Kronecker product, that is, for A∈ R^m^×ⁿ and B ∈R^p^×^q,

A⊗B =







a₁₁B a₁₂B · · · a_1nB a₂₁B a₂₂B · · · a_2nB

... ... . .. ... am1B am2B · · · amnB





∈R^mp^×^nq,

and Σ^∗ =V [ε_ℓ−ε] = (1/n_ℓ−1/n) Σ. Thus, we have {

tr(D⁻¹Σ^∗)}₂

=d²(1/n_ℓ−1/n)². Since D⁻¹⊗D⁻¹ is a diagonal matrix, (8.2) can be written as

V [

(ε_ℓ−ε)^TD⁻¹(ε_ℓ−ε)]

= tr{

(D⁻¹⊗D⁻¹)E[ diag{

(ε_ℓ−ε)(ε_ℓ−ε)^T ⊗(ε_ℓ−ε)(ε_ℓ−ε)^T}]}

−d² (1

nℓ − 1 n

, by the property of the trace of the relevant matrix. The diagonal elements can be written as

D⁻¹⊗D⁻¹E[ diag{

(ε_ℓ−ε)(ε_ℓ−ε)^T ⊗(ε_ℓ−ε)(ε_ℓ−ε)^T}]

= diag (v₁, . . . , v_d2), (8.3) where

v_j =









 E

[

(εℓs−εs)⁴ ]

σ²_ss , forj= (s−1)d+s, s∈ {1, . . . , d}, E

[

(ε_ℓs−ε_s)²(ε_ℓt−ε_t)² ] σssσtt

, for all other values of j < d², and s̸=t,

and ε_ℓs and εs are sth element ofε_ℓ andεrespectively.

Next, we expandε_ℓs−ε_s. This diﬀerence can be written as ε_ℓs−ε_s = 1

n_ℓ

nℓ

∑

i=1

ε_ℓis− 1 n

∑K k=1

∑

i=1

ε_kis

= ( 1

n_ℓ − 1 n

)∑nℓ

i=1

ε_ℓis− 1 n

∑

k̸=ℓ nk

∑

i=1

ε_kis. Using the propertiesE[ε_ℓsε_ks] =E[ε_ℓs]E[ε_ks] and E[ε_ℓs] = 0, we have

E [

(ε_ℓs−ε_s)⁴ ]

= ( 1

n_ℓ − 1 n



 (_n

∑ℓ

i=1

ε_ℓis )4

+ 1 n⁴E







∑

k̸=ℓ nk

∑

i=1

ε_kis





4



+ 6 n²

( 1 nℓ − 1

n )₂



 (_n

∑ℓ

i=1

εℓis

)₂

E







∑

k̸=ℓ n_k

∑

i=1

εkis





2

.

In particular, we ﬁnd that E



 ( _n

∑ℓ

i=1

ε_ℓis )4

 = n_ℓE[ ε⁴_ℓis]

+ 3n_ℓ(n_ℓ−1){ E[

ε²_ℓis]}2

= n_ℓξ_ss+ 3n_ℓ(n_ℓ−1)σ²_ss,







∑

k̸=ℓ n_k

∑

i=1

εkis





4

 = ∑

k̸=ℓ

nkE[ ε⁴_kℓs]

+ 3∑

k̸=ℓ

nk(nk−1){ E[

ε²_kℓs]}2

= (n−nℓ)ξss+ 3∑

k̸=ℓ

nk(nk−1)σ²_ss,



 ( _n

∑ℓ

i=1

ε_ℓis )₂

 = V [_n

∑ℓ

i=1

ε_ℓis ]

= n_ℓσ_ss,







∑

k̸=ℓ nk

∑

i=1

ε_kis





2

 = V



∑

k̸=ℓ nk

∑

i=1

ε_kis



 = (n−n_ℓ)σss.

whereξst =E[

ε²_11sε²_11t]

. Therefore, the ((s−1)d+s)th diagonal element of (8.3) becomes E

[

(ε_ℓs−ε_s)⁴ ]

= (n−n_ℓ)(3n²_ℓ−3nn_ℓ+n²)

n³n³_ℓ ξ_ss+ 3 n⁴

∑

k̸=ℓ

n_k(n_k−1)σ_ss²

+3 (1

nℓ − 1 n

)₂

(n−n_ℓ)(n_ℓ(n_ℓ+ 1) +n(n_ℓ−1)) nℓn² σ_ss² .

From tedious but direct calculations we have E

[( _n

∑ℓ

i=1

εℓis

) (_n

∑ℓ

i=1

εℓit

)]

=nℓσst,



 (_n

∑ℓ

i=1

ε_ℓis )2

∑

k̸=ℓ nk

∑

i=1

ε_kit





2

=n_ℓ(n−n_ℓ)σssσtt,







∑

k̸=ℓ nk

∑

i=1

ε_kis



 (_n

∑ℓ

i=1

ε_ℓis ) 

∑

k̸=ℓ nk

∑

i=1

ε_kit



 (_n

∑ℓ

i=1

ε_ℓit )

=n_ℓ(n−n_ℓ)σ_st²,



 (_n

∑ℓ

i=1

ε_ℓis )₂(_n

∑ℓ

i=1

ε_ℓit )₂

=n_ℓξ_st+n_ℓ(n_ℓ−1)σ_ssσ_tt+τ_ℓσ_st²,







∑

k̸=ℓ nk

∑

i=1

ε_kis





2

∑

k̸=ℓ nk

∑

i=1

ε_kit





2

= (n−n_ℓ)ξst+ (n−n_ℓ)(n−n_ℓ−1)σssσtt+τ_−ℓσ_st², whereτ_ℓ andτ₋_ℓ are the numbers of combinations that arose throughout the calculations, and whose orders are O(n²). The diﬀerent expressions above lead to

E [

(ε_ℓs−ε_s)²(ε_ℓt−ε_t)² ]

= ( 1

n_ℓ − 1 n



 (_n

∑ℓ

i=1

ε_ℓis )2(_n

∑ℓ

i=1

ε_ℓit )2



+ ( 1

nℓ − 1 n

)₂ 1 n²E



 (_n

∑ℓ

i=1

ε_ℓis )₂

∑

k̸=ℓ n_k

∑

i=1

ε_kit





2



+4 (1

n_ℓ − 1 n

1 n²E







∑

k̸=ℓ nk

∑

i=1

ε_kis



 (_n

∑ℓ

i=1

ε_ℓis ) 

∑

k̸=ℓ nk

∑

i=1

ε_kit



 (_n

∑ℓ

i=1

ε_ℓit )



+ ( 1

n_ℓ − 1 n

1 n²E







∑

k̸=ℓ nk

∑

i=1

ε_kis





2(_n

∑ℓ

i=1

ε_ℓit )2



+ 1 n⁴E







∑

k̸=ℓ n_k

∑

i=1

ε_kis





2

∑

k̸=ℓ n_k

∑

i=1

ε_kit





2



= (n−n_ℓ)(3n²_ℓ −3n_ℓn+n²)

n³n³_ℓ ξ_st−(n−n_ℓ)(n²_ℓn+ 3n²_ℓ −n_ℓn²−3n_ℓn+n²)

n³_ℓn σ_ssσ_tt+τ σ_st²,

where τ = (1/n_ℓ−1/n)⁴τ_ℓ+ (1/n)⁴τ₋_ℓ+ 4 (1/n_ℓ−1/n)²n_ℓ(n−n_ℓ)/n². Combining the above calculations results in

V [

(ε_ℓ−ε)^TD⁻¹(ε_ℓ−ε)]

= O

(d² n³

) +O

( 1 n²

) ∑

s,t

ρ²_st,

where ρ_st is the (s, t) component of the correlation matrixR. The sum is evaluated as

∑

s,t

ρ²_st =1^T_d(R⊙R)1_d≤λ_max(R) {

1max≤s≤dρ_ss }

1^T_d1_d≤b₀d

by the deﬁnition of the parameter space Θ in (4.3), where ⊙ is the Hadamard product;

that is, if Aand B arem×nmatrices, then

A⊙B =







a₁₁b₁₁ · · · a_1nb_1n ... . .. ... am1bm1 · · · amnbmn





.

Therefore, (8.2) can be evaluated as V [

(ε_ℓ−ε)^TD⁻¹(ε_ℓ−ε)]

= O

(d² n³

) . Using Chebyshev’s inequality, for any ε >0, we have

P (

(εℓ−ε)^TD⁻¹(εℓ−ε)−E[

(εℓ−ε)^TD⁻¹(εℓ−ε)] C_d

> ε

)

≤ O ( d²

n³C_d² )

= o(1).

Hence, (ε_ℓ−ε)^TD⁻¹(ε_ℓ−ε) can be evaluated as (εℓ−ε)^TD⁻¹(εℓ−ε) =

( 1 n_ℓ − 1

n )

d+oP(Cd).

Next, we evaluateV [

(εℓ−ε)^TD⁻¹(εk−ε)]

forℓ̸=k. Using Theorems 7.7 and 7.14–

7.16 of Schott (1996), we get V [

(ε_ℓ−ε)^TD⁻¹(ε_k−ε)]

= tr{

(D⁻¹⊗D⁻¹)E[

diag{(ε_ℓ−ε)(ε_k−ε)^T ⊗(ε_ℓ−ε)(ε_k−ε)^T}]}

−{

tr(D⁻¹E[

diag(ε_ℓ−ε)(ε_k−ε)^T]}2

We ﬁrst calculate the jth diagonal element of (ε_ℓ−ε)(ε_k−ε)^T. By noting that ε_ℓj−ε_j = 1

nℓ n_ℓ

∑

i=1

ε_ℓij− 1 n

∑K k=1

n_k

∑

i=1

ε_kij

= ( 1

nℓ − 1 n

)∑n_ℓ i=1

ε_ℓij−1 n

n_k

∑

i=1

ε_kij− 1 n

∑

h̸=ℓ,k n_h

∑

i=1

ε_hij, we have E[(ε_ℓj −ε_j)(ε_kj−ε_j)] =−σ_jj/n.Consequently, we obtain

{tr(D⁻¹E[

diag(ε_ℓ−ε)(ε_k−ε)^T]}2

= d² n². Next, we consider the diagonal matrix

(D⁻¹⊗D⁻¹)E[

diag{(ε_ℓ−ε)(ε_k−ε)^T ⊗(ε_ℓ−ε)(ε_k−ε)^T}]

= diag (u₁, . . . , u_d²), where

uj =









 E

[

(ε_ℓs−εs)²(ε_ks−εs)² ]

σ²_ss , j= (s−1)d+sfors∈ {1, . . . , d}, E[(ε_ℓs−ε_s) (ε_ks−ε_s) (ε_ℓt−ε_t) (ε_kt−ε_t)]

σ_ssσ_tt , for all other values of j < d², and s̸=t.

If j= (s−1)d+s, then we have E[

(ε_ℓs−ε_s)²(ε_ks−ε_s)²]

= n(n_ℓ+n_k)−3n_ℓn_k

n_ℓn_kn³ ξ_ss+κ(ℓ, k)σ²_ss,

where κ(ℓ, k) is the coeﬃcient of σ²_ss. Note that the order of κ(ℓ, k) is O(1/n²) which we state here without giving a detailed proof. On the other hand, we have

E[(ε_ℓs−ε_s) (ε_ks−ε_s) (ε_ℓt−ε_t) (ε_kt−ε_t)]

= n(nℓ+nk)−3nℓnk

n_ℓn_kn³ ξst+ { 3

n³ − 1 n²

( 1 n_ℓ + 1

n_k )

+ 1 n²

}

σssσtt+τ σ²_st when ℓ=k, whereτ =O(1/n²). From the above calculations, we have

V [

(ε_ℓ−ε)^TD⁻¹(ε_k−ε)]

= O

(d² n³

) +O

( 1 n²

) ∑

s,t

ρ²_st

= O

(d² n³

) .

Chebyshev’s inequality now implies that (ε_ℓ−ε)^TD⁻¹(ε_k−ε) = −d/n+oP(C_d), and consequently,

N^1/2E₀^TDb⁻¹E₀N^1/2

=N^1/2(

(εℓ−ε)^TD⁻¹(εk−ε))

1≤ℓ,k≤KN^1/2(1 +oP(1))

=N^1/2 (

d (1

n_ℓ − 1 n

)

δℓ,k−d

n(1−δℓ,k) +oP(Cd) )

1≤ℓ,k≤K

N^1/2(1 +oP(1))

= d

n(I_K−N^1/21_K1^T_KN^1/2) +1_K1^T_Ko_P(C_d).

The previous calculations can now be summarized and lead to the desired expansion of Cb^TC/Cb d, namely

Cb^TCb

C_d = C^TC C_d + d

nC_d(I_K−N^1/21_K1^T_KN^1/2) +1_K1^T_Ko_P(1)

= C^TC

C_d +ξ(IK−Π^1/21K1^T_KΠ^1/2) +1K1^T_KoP(1).

Proof of Lemma 4.2

From Weyl’s inequality (see e.g. Bhatia (1997)),λα can be evaluated as max

{λ^∗_α+1

C_d +ξ, λ^∗_α C_d

}

≤ λ_α C_d ≤ λ^∗_α

C_d +ξ, (8.4)

forα= 1, . . . , K−1 and 0≤λ_K/C_d≤λ^∗_K/C_d+ξ=ξ. In particular, it follows from (8.4) that

λ^∗_α+1

C_d +ξ < λα

C_d ≤ λ^∗_α C_d+ξ

by Condition D. Therefore, λ_α/C_d should be simple.

Proof of Theorem 4.1

Put Γ_K = [γ₁, . . . ,γ_K], where γ_ℓ is eigenvector of C^TC/C_d+ξ(I_K −Π^1/21_K1^T_KΠ^1/2) belonging to the ℓth largest eigenvalue. ByLemma 4.1, we obtain

Γ^T_KCb^TCb Cd

ΓK= diag (λ₁

, . . . ,λ_K Cd

)

(1 +oP(1)).

LetHb = [hb₁ · · · hb_K ], wherehb_ℓ is eigenvector of Γ^T_K

(Cb^TC/Cb _d )

Γ_K belonging to theℓth largest eigenvalue. Since all eigenvaluesλα (forα = 1, . . . , K−1) are simple byLemma 4.2, it follows that Hb −→^P I_K. From the equation Γ^T_K

(Cb^TC/Cb _d )

Γ_Khb_ℓ = (bλ_ℓ/C_d)hb_ℓ we can see that

Γ^T_KCb^TCb

C_d ΓKhb_ℓ= bλ_ℓ C_dhb_ℓ

=⇒ Cb^TCb Cd

( Γ_Khb_ℓ

)

= bλ_ℓ Cd

( Γ_Khb_ℓ

)

=⇒ CbCb^T Cd

{ Cγb _ℓ

||Cγb _ℓ||(1 +o_P(1)) }

= bλ_ℓ Cd

{ Cγb _ℓ

||Cγb _ℓ||(1 +o_P(1)) }

. (8.5)

On the other hand,

CbCb^T

C_d bp_ℓ= bλ_ℓ

C_dbp_ℓ (8.6)

follows from the deﬁnition in Subsection 4.1. Now, from (8.5), (8.6) and Lemma 4.2, we conclude that the linear span of thepb_α is asymptotically equal to that of theCγb _α/||Cγb _α||. Since eigenvectors have unit length, ||p_α|| = 1 and sgn (ˆpα1) = sgn

((Cγb _α/||Cγb _α||)

) , where (·)₁ denotes the ﬁrst component of the vector. Therefore, we have

p_α= Cγb _α

||Cγb _α||(1 +o_P(1)) =⇒ pb^T_α Cγb _α

||Cγb _α|| = 1 +o_P(1).

Proof of Theorem 4.2

FromTheorem4.1 and (4.7), the inner product ofpb_α and p_β is given by b

p^T_αp_β = γ^T_αCb^TCγ_β(1 +oP(1))

√

γ^T_αCb^TCγb _α

√

γ^T_βC^TCγ_β

= γ^T_αΠ^1/2Mc₀^TD⁻¹M₀Π^1/2γ_β(1 +o_P(1))

√

γ^T_αCb^TCγb _α

√

γ^T_βC^TCγ_β

. (8.7)

The numerator of (8.7) can be evaluated as

γ^T_αΠ^1/2Mc₀^TD⁻¹M₀Π^1/2γ_β = γ^T_αΠ^1/2M₀^TD⁻¹M₀Π^1/2γ_β(1 +o_P(1))

= γ^T_αC^TCγ_β(1 +oP(1))

by Chebyshev’s inequality. By Lemma 4.1, γ^T_αCb^TCγb _α of (8.7) becomes γ^T_αCb^TCγb _α = λ_α(1 +o_P(1)) . Notice thatγ^T_βC^TCγ_β of the denominator of (8.7) can be written as

γ^T_βC^TCγ_β

=γ^T_β {

(C^TC+C_dξ(I_K−Π^1/21_K1^T_KΠ^1/2))−C_dξ(I_K−Π^1/21_K1^T_KΠ^1/2) }

γ_β

=λβ−Cdξ(1−γ^T_βΠ^1/21K1^T_KΠ^1/2γ_β).

Therefore, we obtain b

p^T_αp_β = κ_βδ_αβ −ξ(δ_αβ −η_αη_β)

√κα

√

κβ−ξ(1−η_β²)

(1 +o_P(1)). (8.8)

Proof of Corollary 4.1

From (8.8) in the proof of Theorem 4.2, and the assumptions of Corollary 4.1, it follows that

p^T_αp_β = κ_βδ_αβ

√κα√κβ

(1 +oP(1))

{ 1 +o_P(1) ifα =β, o_P(1) ifα ̸=β,

sinceξ →0.

Proof of Theorem 4.3

The inner product of bb^∗_α and b^∗_β becomes bb^∗_α^Tb^∗_β = bp^T_αDb⁻^1/2D⁻^1/2p_β

√ b

p^T_αDb⁻¹bp_α

√

p^T_βD⁻¹p_β

= γ^T_αCb^TDb⁻^1/2D⁻^1/2Cγ_β

√

γ^T_αCb^TDb⁻¹Cγb _α

√

γ^T_βC^TD⁻¹Cγ_β

(1 +oP(1)). (8.9)

using Theorem4.1, (4.7) and (4.11). The numerator of (8.9) can be evaluated as γ^T_αCb^TDb⁻^1/2D⁻^1/2Cγ_β =γ^T_αC^TD⁻¹Cγ_β(1 +oP(1)).

Using (8.1), Cb^TDb⁻¹Cb of (8.9) is given by

Cb^TDb⁻¹Cb=C^TD⁻¹C+N^1/2E^TDb⁻²EN^1/2+1K1^T_Ko(C_d).

Therefore, we have γ^T_αCb^TDb⁻¹Cγb _α

≤γ^T_αC^TD⁻¹Cγ_α+ 1

σ_minγ^T_αN^1/2E^TDb⁻¹EN^1/2γ_α(1 +o_P(1)) +o(C_d)

=γ^T_αC^TD⁻¹Cγ_α

× (

1 +C_dξ 1 σ_min

1−γ^T_αΠ^1/21_K1^T_KΠ^1/2γ_α

γ_α^TC^TD⁻¹Cγ_α (1 +o_P(1)) +o

( C_d γ^T_αC^TD⁻¹Cγ_α

))

≤γ^T_αC^TD⁻¹Cγ_α

× (

1 +C_dξσ_max σmin

1−γ^T_αΠ^1/21_K1^T_KΠ^1/2γ_α

γ^T_αC^TCγ_α (1 +o_P(1)) +o

( C_d γ^T_αC^TCγ_α

))

=γ^T_αC^TD⁻¹Cγ_α

× (

1 +Cdξσmax

σmin

1−η²_α

λα−C_dξ(1−η_α²)(1 +oP(1)) +o

( Cd

λα−C_dξ(1−η_α²) ))

=γ^T_αC^TD⁻¹Cγ_ακ_α−ξ(1−η_α²) (1−σ_max/σ_min)

κα−ξ(1−η²_α) (1 +o_P(1)),

where σ_max= max₁_≤_j_≤_dσ_jj and σ_min= min₁_≤_j_≤_dσ_jj. Hence it follows that bb^∗_α^Tb^∗_β ≥ b^∗_α^Tb^∗_β

√κ_α−ξ(1−η²_α)

√κα−ξ(1−η_α²) (1−σmax/σmin)(1 +o_P(1)). (8.10) Similarly, we obtain

bb^∗_α^Tb^∗_β ≤ b^∗_α^Tb^∗_β

√κ_α−ξ(1−η²_α)

√κα−ξ(1−η_α²) (1−σmin/σmax)(1 +o_P(1)). (8.11)

Proof of Corollary 4.2

From the assumption d=o(nC_d) that (8.10) and (8.11) can be evaluated as bb^∗_α^Tb^∗_β ≥ b^∗_α^Tb^∗_β

√κα−ξ(1−η_α²)

√κα−ξ(1−η²_α) (1−σmax/σmin)(1 +o_P(1))

= b^∗_α^Tb^∗_β(1 +o_P(1)) and

bb^∗_α^Tb^∗_β ≤ b^∗_α^Tb^∗_β

√κ_α−ξ(1−η_α²)

√κα−ξ(1−η²_α) (1−σmin/σmax)(1 +o_P(1))

= b^∗_α^Tb^∗_β(1 +oP(1))

respectively. Therefore, we have bb^∗_α^Tb^∗_β =b^∗_α^Tb^∗_β(1 +oP(1)).

Evaluation of the Misclassiﬁcation Rate W(bg, θ)

Suppose that the random vector X belongs to Ck. The correct classiﬁcation rate of bgfor class Ck is deﬁned as

W_k(bg, θ) = P( bg(X) =k |X_ℓi, ℓ= 1, . . . , K; i= 1, . . . , n_ℓ )

= P( bg(X) =k |X ). We have

Wk(bg, θ) = P



 ∩

α̸=k

{ ω ∈Ω

(

X(ω)−1

2(µb_k+µb_α) )_T

b wkα >0

} X





= P



 ∩

α̸=k

{

ω∈Ω bδ_kα(X(ω))>0 }



,

where wb_kα=Bb^T(BbDbBb^T)⁻¹(µb_k−µb_α). We can easily see that δb_kα(X) ∼ N

((

µ_k−1

2(µb_k+µb_α) )T

w_kα, wb^T_kαΣwb_kα )

, α̸=k.

Therefore, W_k(bg, θ) can be written as Wk(bg, θ) = P



 ∩

α̸=k

{

ω ∈ΩZˆkα(ω)>−dˆkα

} X



,

where ˆZkα=

(bδkα(X)−E

[δbkα(X)

]) /√

[δbkα(X)

] ∼N(0,1) and

dˆ_kα = E

[bδ_kα(X) ]

√ V

[δb_kα(X) ]

= (µ_k−(µb_k+µb_α)/2)^T Bb^T(BbDbBb^T)⁻¹B(b µb_k−µb_α)

√(bµ_k−µb_α)^TBb^T(BbDbBb^T)⁻¹BbΣBb^T(BbDbBb^T)⁻¹B(bb µ_k−µb_α)

. (8.12)

Next, we evaluate the (i, j)th element of the covariance matrix of ( ˆZ_k1, . . . ,Zˆ_kK)^T, where i, j ∈ {1, . . . , K} − {k} and i ̸= j. From bδ_kα(X) −E

[δb_kα(X) ]

= (X −µ_k)^Twb_kα, Cov( ˆZ_ki,Zˆ_kj) can be written as

Cov( ˆZki,Zˆkj) = wb^T_kiΣwb_kj

√wb^T_kiΣwbki

√wb^T_kjΣwbkj

Therefore, the covariance matrix of Zb_k= ( ˆZ_k1, . . . ,Zˆ_k(K₋₁₎)^T is

Σb_k=Wc_k^TΣcW_k, (8.13)

where ˆZ_kα=I(α < k) ˆZ_kα+I(α≥k) ˆZ_k(α+1), Wc_k=



 wb_k1

√wb^T_k1Σwb_k1

· · · wb_k(K₋₁₎

√wb^T_k(K₋₁₎Σwb_k(K₋₁₎





and wb_kα =I(α < k)wb_kα+I(α≥k)wb_k(α+1). Now consider the region Dbk =

{

z∈R^K⁻¹ zj <dˆ_kα, α∈ {1, . . . , K−1}} ,

where ˆd_kα =I(α < k) ˆd_kα+I(α≥k) ˆd_k(α+1). Since−Z_kis also distributed asN_K₋₁(0,Σb_k), the correct probability can be obtained as

Wk(bg, θ) = P

( _K₋₁

∩

α=1

{

ω∈Ω −Zˆkα(ω)<dˆ_kα} X

)

∫

Dbk

√ 1

|2πΣb_k| exp

(

−1

2z^TΣb⁻_k¹z )

= Φ_K₋₁

(Dbk; 0,Σb_k )

Therefore, the misclassiﬁcation rate of bg for classCk becomes W_k(bg, θ) = 1−W_k(bg, θ) = 1−Φ_K₋₁

(Dbk; 0,Σb_k )

Proof of Theorem 4.4

By Theorem4.1, Bb is given by

Bb=Db⁻^1/2Pb =Db⁻¹Mc0N^1/2ΓLb⁻¹(1 +oP(1)), whereLb= diag

(||Cγb ₁||, . . . ,||Cγb _K₋₁||)

. UsingDb =D(1+o_P(1)), (8.12) can be evaluated as

db_kα = (µ_k−(µb_k+µb_α)/2)^T B(b Bb^TDbB)b ⁻¹Bb^T(µb_k−µb_α)

√

(µb_k−µb_α)^TB(b Bb^TDbB)b ⁻¹Bb^TΣBb(Bb^TDbB)b ⁻¹Bb^T(µb_k−µb_α)

≥ 1

√λ_max(R)

I1N^1/2Γ(Γ^TN^1/2I2N^1/2Γ)⁻¹Γ^TN^1/2I₃^T

√

I3N^1/2Γ(Γ^TN^1/2I2N^1/2Γ)⁻¹Γ^TN^1/2I₃^T

(1 +o_P(1)),

where I₁ = (µ_k−(µb_k+µb_α)/2)^T D⁻¹Mc₀,I₂ =Mc₀^TD⁻¹Mc₀ and I₃ = (µb_k−µb_α)^TD⁻¹Mc₀. We ﬁrst calculateI3. Note that I3 can be decomposed as

I₃ = (µb_k−µb_α)^TD⁻¹Mc₀

= [

(bµ_k−µb_α)^TD⁻¹(µb₁−µ), . . . ,b (bµ_k−µb_α)^TD⁻¹(µb_K−µ)b ]

. (8.14) FromCondition A, a typical component of (8.14) can be expressed as

(µb_k−µb_α)^TD⁻¹(µb_ℓ−µ)b

∑K h=1

n_h n

[

(µ_k−µ_α)^TD⁻¹(µ_ℓ−µ_h) + (ε_k−ε_j)^TDb⁻¹(µ_ℓ−µ_h) +(µ_k−µ_α)^TD⁻¹(ε_ℓ−ε_h) + (ε_k−ε_α)^TD⁻¹(ε_ℓ−ε_h)]

. Then we have

(ε_k−ε_α)^TD⁻¹(µ_ℓ−µ_h) =o_P(

(µ_ℓ−µ_h)^TD⁻¹(µ_ℓ−µ_h)) (µ_k−µ_α)^TD⁻¹(ε_ℓ−ε_h) =oP

((µ_k−µ_α)^TD⁻¹(µ_k−µ_α))

by p.2625 of Fan and Fan (2008). Next we examine∑_K

h=1(n_h/n)(ε_k−εα)^TD⁻¹(ε_ℓ−ε_h), which can be written as

∑K h=1

n_h

n(ε_k−ε_α)^TD⁻¹(ε_ℓ−ε_h) = ε^T_kD⁻¹ε_ℓ−ε^T_kD⁻¹ε−ε^T_αD⁻¹ε_ℓ+ε^T_αD⁻¹ε.

By an argument similar to that given on p.2627 of Fan and Fan (2008), we obtain

∑K h=1

n_h

n (ε_k−ε_α)^TD⁻¹(ε_ℓ−ε_h) =









 d n_k +oP

(√d n

)

ifℓ=k,

− d nα

+o_P (√d

n )

ifℓ=α,

(√d n

)

otherwise.

We also need to evaluate the asymptotic order of (µ_k−µ_α)^TD⁻¹M₀, which can be written as

(µ_k−µ_α)^TD⁻¹M0

= [ _K

∑

ℓ=1

πℓ(µ_k−µ_α)^TD⁻¹(µ₁−µ_ℓ), . . . ,

∑K ℓ=1

πℓ(µ_k−µ_α)^TD⁻¹(µ_K−µ_ℓ) ]

=1^T_KΠF,

where F = [f₁ · · · f_K ] = ((µ_k−µ_α)^TD⁻¹(µ_i−µ_j))1≤i,j≤K. UsingCondition E and Condition F,ℓth component of (µ_k−µ_α)^TD⁻¹M₀ has the following form

1^T_KΠf_ℓ









 C_d



−∑

h̸=k

√π_hµ^T_kD⁻¹µ_k C_d −√

π_αµ_αD⁻¹µ_α

C_d +∑

β̸=k

cβk

C_d



 ifℓ=k,

C_d



√

π_kµ_kD⁻¹µ_k Cd

+∑

h̸=α

√π_hµ^T_αD⁻¹µ_α Cd

+∑

β̸=α

c_βα Cd



 ifℓ=α,

C_d



√

π_kµ_kD⁻¹µ_k C_d −√

πα

µ_αD⁻¹µ_α

C_d +∑

β̸=ℓ

c_βℓ C_d



 otherwise.

=O(C_d),

where c_βℓ=O(C_d^ζ^βℓ) andζ_βℓ∈(0,1) for all β, ℓ. Therefore, we have (µ_k−µ_α)^TD⁻¹M₀=1^T_KΠF =O(C_d)1^T_K. Using the above calculations, we have

(µb_k−µb_α)^TD⁻¹(µb_ℓ−µ)b

= (µ_k−µ_α)^TD⁻¹(µ_ℓ−µ)(1 +o_P(1)) +

∑K h=1

n_h

n (ε_k−ε_α)^TD⁻¹(ε_ℓ−ε_h) +o_P

( max

h∈{1,...K}

{(µ_ℓ−µ_h)^TD⁻¹(µ_ℓ−µ_h),(µ_k−µ_α)^TD⁻¹(µ_k−µ_α)})









 {

(µ_k−µ_α)^TD⁻¹(µ_ℓ−µ) + d nk

}

(1 +o_P(1)) ifℓ=k, {

(µ_k−µ_α)^TD⁻¹(µ_ℓ−µ)− d n_α

}

(1 +o_P(1)) ifℓ=α, (µ_k−µ_α)^TD⁻¹(µ_ℓ−µ)(1 +o_P(1)) otherwise, by Condition D. Thus, it follows that

I₃ = (

(µ_k−µ_α)D⁻¹M₀+β_kα)

(1 +o_P(1)),

whereβ_kα= (0, . . . ,0, d/n_k,0, . . . ,0,−d/n_α,0, . . . ,0).Next, we considerI₁. We ﬁnd that I₁ =

( µ_k−1

2(µb_k+µb_α) )T

D⁻¹Mc₀ =−ε^T_kD⁻¹Mc₀+1

2(µb_k−µb_α)^TD⁻¹Mc₀. (8.15) Similarly, (8.15) becomes

−ε^T_kD⁻¹Mc₀+1

2(µb_k−µb_α)^TD⁻¹Mc₀









 d n

( 1− n

n_k )

+1 2

{

(µ_k−µ_α)^TD⁻¹(µ_ℓ−µ) + d n_k

}

(1 +oP(1)) ifℓ=k, d

n+ 1 2

{

(µ_k−µ_α)^TD⁻¹(µ_ℓ−µ)− d nα

}

(1 +o_P(1)) ifℓ=α, d

n+ 1

2(µ_k−µ_α)^TD⁻¹(µ_ℓ−µ)(1 +oP(1)) otherwise,









 [1

2(µ_k−µ_α)^TD⁻¹(µ_ℓ−µ) +d n

( 1− n

2n_ℓ )]

(1 +o_P(1)) ifℓ=k, α, [1

2(µ_k−µ_α)^TD⁻¹(µ_ℓ−µ) +d n ]

(1 +oP(1)) otherwise.

Therefore, we have I1 =

2(µ_k−µ_α)^TD⁻¹M0+d nα_kα

]

(1 +oP(1)), where

α_kα= (

1, . . . ,1,1− n 2nk

,1, . . . ,1, ,1− n 2nα

,1, . . . ,1 )

. Finally, we consider I2. It can be written as

I2=Mc0D⁻¹Mc0=(

(µb_α−µ)b ^TD⁻¹(bµ_β−µ)b )

1≤α,β≤K. (8.16) Each component of (8.16) can be decomposed as

(bµ_α−µ)b ^TD⁻¹(µb_β−µ) =b {

(µ_α−µ)^TD⁻¹(µ_β−µ) +J1+J2

}(1 +oP(1)) +J3, (8.17)

whereJ₁ = (ε_α−ε)^TD⁻¹(µ_β−µ), J₂ = (µ_α−µ)^TD⁻¹(ε_β−ε) andJ₃ = (ε_α−ε)^TD⁻¹(ε_β− ε). From calculations similar to those carried out in the derivation ofI₁ and I₃, we get

J1 = o_P(

(µ_β−µ)^TD⁻¹(µ_β−µ))

, J2 = o_P(

(µ_α−µ)^TD⁻¹(µ_α−µ)) ,

J3 = (εα−ε)^TD⁻¹(εβ−ε) =









 d n

( n n_α −1

) +o_P

(√d n

)

ifα=β,

−d n+o_P

(√d n

)

ifα̸=β.

Consequently, (8.17) results in (µb_α−µ)b ^TD⁻¹(bµ_β−µ)b









 {

(µ_α−µ)^TD⁻¹(µ_α−µ) +d n

( n n_α −1

)}

(1 +o_P(1)) ifα=β, {

(µ_α−µ)^TD⁻¹(µ_β−µ)− d n

}

(1 +oP(1)) ifα̸=β.

Therefore, we have I2 =

{

M₀^TD⁻¹M0+ d n

(N⁻¹−1K1^T_K)}

(1 +oP(1)).

In summary, the components ofdb_kα can be evaluated as I1N^1/2 =

2(µ_k−µ_α)^TD⁻¹M0+ d nα_kα

]

N^1/2(1 +oP(1))

= [1

2Mkα+ d

nskαΠ⁻^1/2 ]

(1 +oP(1))

= S_kα(1 +o_P(1)), N^1/2I₂N^1/2 = N^1/2

[

M₀^TD⁻¹M₀+ d n

(N⁻¹−1_K1^T_K)]

N^1/2(1 +o_P(1))

= [

C^TC+ d n

(

I_K−Π^1/21_K1^T_KΠ^1/2 )]

(1 +o_P(1)), I3N^1/2 = [

(µ_k−µ_α)D⁻¹M0+β_kα]

N^1/2(1 +oP(1))

= [

M_kα+ d

nq_kαΠ⁻^1/2 ]

(1 +o_P(1))

= Qkα(1 +oP(1)) sinceN = Π(1 +oP(1)). Therefore, we have

dbkα

≥ S_kαΓ[ Γ^T {

C^TC+ (d/n)(

I_K−Π^1/21_K1^T_KΠ^1/2)}

Γ]₋1

Γ^TQ^T_kα(1 +o_P(1))

√λ_max(R)

√ Q_kαΓ[

Γ^T {

C^TC+ (d/n)(

I_K−Π^1/21_K1^T_KΠ^1/2)}

Γ]₋1

Γ^TQ^T_kα

. (8.18)

This completes the proof of Theorem4.4.

Proof of Corollary 4.3

UsingM_kα =1^T_KO(C_d) andC^TC =1_K1^T_KO(C_d),S_kα of (8.18) becomes S_kα = M_kα

2 +C_d ( d

nC_d )

s_kαΠ⁻^1/2

= M_kα

2 +1^T_Ko(C_d)

= M_kα

2 (1 +oP(1)).

Similarly, we can obtain Q_kα=M_kα(1 +oP(1)) and U =C^TC(1 +oP(1)).

Proof of Theorem 4.6

The right side of (4.17) can be written as 1−Φ

(√n₁n₂/(dn)α^TD⁻¹α(1 +o_P(1)) + (n₁−n₂)√

d/(nn₁n₂) 2√

λmax(R)√

1 +n1n2α^TD⁻¹α(1 +oP(1))/(dn) )

= 1−Φ

( √n₁n₂/(dn)α^TD⁻¹α[

(1 +o_P(1)) +{d/(nα^TD⁻¹α)}{(n₁−n₂)/n₁}] 2√

λ_max(R)√

n₁n₂α^TD⁻¹α/(dn)√

(n/n₁){d/(n₂α^TD⁻¹α)}+ (1 +o_P(1)) )

= 1−Φ

( √

n₁n₂/(dn)α^TD⁻¹α[(1 +o_P(1)) +o(1)O(1)]

2√

λmax(R)√

n1n2/(dn)√

α^TD⁻¹α√

O(1)o(1) + (1 +o_P(1)) )

= 1−Φ (

α^TD⁻¹α 2√

λmax(R)(1 +o_P(1)) )

by the assumption d=o(nC_d). Therefore, we have W1(bg) = max

θ∈ΘW1(bg, θ) = 1−Φ ( Cd

2√

b₀(1 +oP(1)) )

Proof of Corollary 4.4

We derive (4.18) as follows:

1−Φ

(√n₁n₂/(dn)α^TD⁻¹α(1 +o_P(1)) + (n₁−n₂)√

d/(nn₁n₂) 2√

λ_max(R)√

1 +n₁n₂α^TD⁻¹α(1 +o_P(1))/(dn) )

= 1−Φ (√

α^TD⁻¹α[(1 +o_P(1)) +{d/(n2C_d)}(C_d/α^TD⁻¹α)(1−n2/n1)]

2√

λmax(R)√

(n/n1){d/(n2α^TD⁻¹α)}+ (1 +oP(1))

)

= 1−Φ (√

α^TD⁻¹α[(1 +o_P(1)) +{d/(n₂C_d)}(C_d/α^TD⁻¹α){1−(c₀+o(1))}] 2√

λ_max(R)√

(n/n₁){d/(n₂α^TD⁻¹α)}+ (1 +o_P(1))

)

≥1−Φ (√

α^TD⁻¹α[(1 +o_P(1)) +{d/(n₂C_d)}(C_d/α^TD⁻¹α)o(1)]

2√

λmax(R)√

(n/n1){d/(n2α^TD⁻¹α)}+ (1 +oP(1)) )

= 1−Φ (√

α^TD⁻¹α{(1 +oP(1)) +O(1)O(1)o(1)}

2√

λ_max(R)√

O(1)O(1)O(1) + (1 +o_P(1)) )

>1−Φ (√

α^TD⁻¹α 2√

λmax(R)(1 +o_P(1)) )

9 Conclusion

In this paper, we discussed the asymptotic theories of the multi-class linear discriminant function in ahdlsscontext. In Section 3, we constructed the linear discriminant function based on naive canonical correlation in the context of multi-class problem. In Section 4, we derived the asymptotic behavior of eigenvectors of the naive canonical correlation matrix corresponding to positive eigenvalues. In the asymptotic theory, both the dimension d and the sample size n grow, and provided d does not grow too fast, we showed that all eigenvectors and discriminant directions arehdlssconsistent. Under suitable conditions, we were able to derive an upper bound for the worst case misclassiﬁcation rate in the multi-class setting. In Section 5, we proposed a feature selection method for hdlssdata, callednacc approach, using a discriminant direction. Further, for the general multi-class setting, we proposed and discussed two methods for feature selection, called m-nacc and m-fair, which extend their respective two-class analogues. If the variance is large relative to the diﬀerence between the means, we illustrate in Subsection 6.4 that nacc and m-nacc performed better thanfair andm-fairrespectively. Applications to real data sets demonstrate thatnacc and m-nacc performed well.

In recent years, other discriminant method forhdlssdata has been studied by many authors: Marron et al. (2007) proposed Distance-Weighted Discrimination (DWD) which improves the performance of Support Vector Machine (SVM) in the hdlss setting, Fan et al. (2012) proposed Regularized Optimal Aﬃne Discriminant (ROAD) which uses the L1-constraint in the Fisher’s criterion (2.3), Ahn et al. (2012) proposed a hierarchical clustering algorithm based on the MDP distance by referring to Ahn and Marron (2010).

Other possible research directions include extensions of our theoretical results to the “ker-nel method” in linear discrimination described in Mika et al. (1999).

Our approach exploits the naive Bayes rule and replacesΣb⁻¹ by the diagonal matrix Db⁻¹. On the other hand, replacing Db⁻¹ by a certain type of band matrices could also yield eﬃcient linear discriminant functions in ahdlsssetting. Such discriminant functions

are of interest in practice, especially when relevant correlation information between the observations is lost in the replacement of Σb⁻¹ by the diagonal matrix Db⁻¹. Theoretical research of k0-banded matrix has been considered in Bickel and Levina (2008), and their results are expected to apply to linear discriminant function in hdlss settings. Further-more, we can explore issues of discriminant function based on invertiblek0-banded matrix:

asymptotic behavior of misclassiﬁcation rate, the selection criteria ofk0, the algorithm for preprocessing correlation ofhdlss data before discrimination.

Acknowledgements

First and foremost, I would like to thank my supervisor Prof. Kanta Naito for all of his help and lots of encouragement to me during the bachelor’s, the master’s and doctoral courses. I would also like to thank Prof. Inge Koch for her many valuable comments and suggestions. Prof. Toshihiro Nakanishi and Prof. Daishi Kuroiwa gave me helpful comments about the draft of this thesis.

This work was supported by Grant-in-Aid for Japan Society for the Promotion of Science (JSPS) Fellows. I am grateful to the JSPS for making this study possible by the ﬁnancial.

Finally, I would like to thank my father and mother for their understanding, support and encouragement during this study.

References

Ahn, J., Lee, M. H. and Yoon, Y. J. (2012). Clustering high dimension, low sample size data using the maximal data piling distance,Statistica Sinica, Vol. 22, p. 443.

Ahn, J. and Marron, J. S. (2010). The maximal data piling direction for discrimination, Biometrika, Vol. 97, pp. 254–259.

Ahn, J., Marron, J. S., Muller, K. M. and Chi, Y.-Y. (2007). The high-dimension, low-sample-size geometric representation holds under mild conditions,Biometrika, Vol. 94, pp. 760–766.

Aoshima, M. and Yata, K. (2011). Two-stage procedures for high-dimensional data, Se-quential analysis, Vol. 30, pp. 356–399.

Bhatia, R. (1997). Matrix Analysis: Springer, New York.

Bickel, P. J. and Levina, E. (2004). Some theory for Fisher’s linear discriminant func-tion, “naive Bayes” and some alternatives when there are many more variables than observations,Bernoulli, Vol. 10, pp. 989–1010.

Bickel, P. J. and Levina, E. (2008). Regularized estimation of large covariance matrices, The Annals of Statistics, Vol. 36, pp. 199–227.

Bishop, C. M. and Nasrabadi, N. M. (2006). Pattern recognition and machine learning, Vol. 1: springer New York.

Devroye, L., Gy¨orﬁ, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recogni-tion: Springer.

Ding, C. and Peng, H. (2005). Minimum redundancy feature selection from microarray gene expression data,Journal of Bioinformatics and Computational Biology, Vol. 3, pp.

185–205.

Duda, R. O., Hart, P. E. and Stork, D. G. (2012). Pattern classification: John Wiley &

Sons.

Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of discrimination methods for the classiﬁcation of tumors using gene expression data, Journal of the American statistical association, Vol. 97, pp. 77–87.

Fan, J. and Fan, Y. (2008). High-dimensional classiﬁcation using features annealed inde-pendence rules,The Annals of Statistics, Vol. 36, pp. 2605–2637.

Fan, J., Feng, Y. and Tong, X. (2012). A road to classiﬁcation in high dimensional space:

the regularized optimal aﬃne discriminant, Journal of the Royal Statistical Society:

Series B (Statistical Methodology), Vol. 74, pp. 745–771.

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems, Annals of Eugenics, Vol. 7, pp. 179–188.

Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomﬁeld, C. D. and Lander, E. S. (1999). Molecular classiﬁcation of cancer: class discovery and class prediction by gene expression monitoring,science, Vol. 286, pp. 531–537.

Gordon, G. J., Jensen, R. V., Hsiao, L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno, R. (2002). Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma,Cancer Research, Vol. 62, pp. 4963–4967.

Hastie, T., Tibshirani, R. and Friedman, J. (2001). The Elements of Statistical Learning:

Springer.

Johnstone, I. M. (2001). On the distribution of the largest principal component, The Annals of Statistics, Vol. 29, pp. 295–327.

Jung, S. and Marron, J. S. (2009). PCA consistency in high dimension, low sample size context,The Annals of Statistics, Vol. 37, pp. 4104–4130.

Khan, J., Wei, J., Ringn´er, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, P. S. (2001). Classiﬁcation and diagnostic prediction of cancers using gene expression proﬁling and artiﬁcial neural networks,Nat. Med., Vol. 7, pp. 673–679.

Koch, I. and Naito, K. (2010). Prediction of multivariate responses with a selected number of principal components,Computational Statistics &Data Analysis, Vol. 54, pp. 1791–

1807.

Mardia, K. V., Kent, J. and Bibby, J. (1979).Multivariate Analysis: Academic Press.

Marron, J. S., Todd, M. J. and Ahn, J. (2007). Distance-weighted discrimination,Journal of the American Statistical Association, Vol. 102, pp. 1267–1271.

Mika, S., R¨atsch, G., Weston, J. and Sch¨olkopf, B. (1999). Fisher discriminant analysis with kernels,IEEE Neural Networks for Signal Processing Workshop, pp. 41–48.

Peng, H., Long, F. and Ding, C. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy,Pattern Analysis and Machine Intelligence, IEEE Transactions on, Vol. 27, pp. 1226–1238.

Rao, C. R. (1948). The utilization of multiple measurements in problems of biological classiﬁcation, Journal of the Royal Statistical Society. Series B (Methodological), Vol.

10, pp. 159–203.

Saeys, Y., Inza, I. and Larranaga, P. (2007). A review of feature selection techniques in bioinformatics,Bioinformatics, Vol. 23, pp. 2507–2517.

Schott, J. (1996). Matrix Analysis for Statistics: New York: Wiley.

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A. A., D’Amico, A. V., Richie, J. P., Lander, E. S., Loda, M., Kantoﬀ, P. W.,

ドキュメント内 k519 fulltext (ページ 60-87)