GAUSSIAN MIXTURE MODELS AND VC DIMENSIONS

(1)

GAUSSIAN MIXTURE MODELS AND VC DIMENSIONS

YOHJI AKAMA

Abstract. We will establish the VC dimension of the class of translated ellip-soids in d-dimensional Euclidean space is between (d/2+1)2and (6.53486· · · )(d2+ 3d). By this, (d/2 + 1)2_{turns out to be a lower bound of the VC dimension of} the concept class that N -component Gaussian mixture models induce through maximum likelihood estimate.

1. Introduction

A Gaussian Mixture Model (GMM) [5] is often used for probability density estimation. It is a probability density function which is linear combination of N Gaussian distributions where N is a predeﬁned positive integer: The parameters of a GMM are usually estimated by the EM algorithm, which assumes that the number N of the components of the mixture is known a priori, but such information is often unavailable in practice.

For selecting N of GMMs, a typical model selection problem, [7] proposed to apply Structural Risk Minimization (SRM) principle [6], a principle to select the number N of components such that the so-called generalization error of maximum likelihood estimate is small. The generalization error depends on the VC dimen-sion of the concept classCN,d that N -component GMMs induce in the sense of [6, Theorem 5.4]. Here d is the dimension of the ambient space data come from. In [7], in analyzing the SRM for GMMs, the VC dimension ofCN,d is claimed to be O(N d log d).

In this report, we show that the VC dimension of CN,d is at least (d/2 + 1)2. For the purpose, we note CN,d contains C1,d, the class of translated ellipsoids of Rd_{. So the VC dimension of}_C

N,d is greater than or equal to that of C1,d. Then we prove the VC dimension of C1,d is greater than or equal to (d/2 + 1)2. The result is derived from the lower bound of VC dimension of the class that principal component analysis induces [1]. Actually, we establish that the VC dimension of C1,dis between (d/2 + 1)2and (6.53486· · · )(d2+ 3d).

In the next section, the lower bounds of the VC dimension ofC1,d is given. In Section 3 we provides an upper bounds ofC1,d, by Milnor-Thom bound [3] as in [1]. Deﬁnition 1.1. For sets X ⊆ Rd and Y ⊆ X, we say that a set B ⊆ Rd cuts Y out of X if Y = X∩ B. A class C of subsets of Rd is said to shatter a set X⊆ Rd if every Y ⊆ X is cut out of X by some B ∈ C. The VC dimension of C, denoted by VCdim(C), is deﬁned to be the maximum n (or ∞ if no such maximum exists) for which some subset ofRd of cardinality n is shattered byC.

Date: June 18, 2009.

1

人工知能学会研究会資料 SIG-DMSM-A901-08 (7/8)

(2)

2 YOHJI AKAMA

2. The Lower Bounds

Deﬁnition 2.1 ([1]). Let d and k < d be nonnegative integers. For a k-dimensional

aﬃne subspace H ofRd _{and a nonnegative real number r, the set of points x}_{∈ R}d such that the distance from x to H is less than or equal to r is called the band with centre H and radius r. DeﬁneCd

k to be the class of all such bands. For example, Cd

0, C12, C31 and C23 are the classes of balls, usual bands, cylinders and slabs, respectively.

Proposition 2.2 ([1]). For any nonnegative integer d and k < d, (k+1)(d−k+1) ≤

VCdim(C_kd)≤ (8.7403 · · · )(k + 1)(d − k + 1).

Deﬁnition 2.3. Let d be nonnegative integer. Let Γd be the class of translated closed ellipsoids whose principal axes have distinct positive lengths.

For example Γ3be the class of{x ∈ R3|t(x− m)Udiag(s1, s2, s3)tU (x−m) ≤ 1} such that s1> s2> s3> 0, U = (u1u2u3)∈ O(3), the group of orthogonal matrix of size 3, and m∈ R3 are arbitrary. The principal axes areRu1,Ru2,Ru3and their lengths are s−1/2₁ , s−1/2₂ , s−1/2₃ respectively.

Fact 2.4. VCdim(Γd) = VCdim(C1,d).

Lemma 2.5. For any nonnegative integers d, if S_k_≥0Cd

k shatters a set X ⊂ R d then Γd does so.

Proof. Suppose a ﬁnite set X ⊂ Rd _{is shattered by} _Cd

k. Let a subset Y be cut out of X by some band of C_kd such that the centre is the translation by some m ∈ Rd _{of some k-dimensional linear subspace W and the radius is r} _{≥ 0. A} translated ellipsoid E deﬁned below cuts Y out of X: the center is m, the principal axes contains an orthonormal basis of the orthogonal linear space W⊥, the lengths are approximately r, while the other other principal axes are suﬃciently long. E belongs to Γd because the number of principal axes of positive length is d which is

greater than the dimension d− k of W⊥. ¤

Combined with Proposition 2.2 and Fact 2.4

Theorem 2.6. For all nonnegative integer d, VCdim(C1,d)≥ (d/2 + 1)2. 3. The Upper Bounds

Lemma 3.1. For all nonnegative integer d, let

Vd:= O(d)× Rd× (R>0)d, (1)

a smooth manifold of dimension d(d + 3)/2.

Deﬁnition 3.2. For nonnegative integer d, for any (A, m, s1, . . . , sd)∈ Vd, deﬁne a set

ec(A, m, s1, . . . , sd)

:={x ∈ Rd|t(x− m)A diag(s1, . . . , sd)tA(x− m) ≤ 1}.

ec(A, m, s1, . . . , sd₋₀) is the translation by m ∈ Rd of the ellipsoid where so-called principal axes are A_·iR (1 ≤ i ≤ d) and the corresponding lengths are s−1_i (1≤ i ≤ d).

(3)

GAUSSIAN MIXTURE MODELS AND VC DIMENSIONS 3

For all σ :{1, . . . , d} → {−1, 1}, deﬁne a function fσ onVd as follows: fσ((A, m, s1, . . . , sd)) = (A diag(σ(1), . . . , σ(d)), m, s1, . . . , sd) .

Then fσ is a function from Vd to itself, and the set of fσ such that σ is any function from {1, . . . , d} to {−1, 1} forms a ﬁnite smooth transformation group for Vd. Because if fσ ﬁxes some element of Vd then fσ is the identity function, we see Vd/{fσ | σ : {1, . . . , d} → {−1, 1}} is a smooth manifold of dimension dimVd= d(d + 3)/2.

Because si’s are distinct, ec induces a bijection from the d(d + 3)/2-dimensional smooth manifoldVd/{fσ | σ : {1, . . . , d} → {−1, 1}} to Γd, the class of translated closed ellipsoids whose d principal axes have distinct positive lengths.

Lemma 3.3. Γd is a smooth manifold of dimension d(d + 3)/2.

For the Euclidean norm of vector a we write ∥a∥, for the boundary of a set U ⊆ Rd _{we write ∂U . By applying Sard’s Theorem [4] to a smooth manifold} ©

(E, y1, . . . , yg)∈ Γd× (Rd)g | yi ∈ ∂E for all i ∈ S ª

of dimension d(d + 3)/2 + (d− 1)|S| + d(g − |S|), we can prove

Lemma 3.4. For all x1, . . . , xg∈ Rd and ε > 0, there exist y1, . . . , yg∈ Rd with ∥x1− y1∥, . . . , ∥xg− yg∥ < ε such that no set in Γd has more than d(d + 3)/2 of the points y1, . . . , yg on its boundary.

Proposition 3.5 (Milnor [3]). By a real aﬃne variety inRm_{, we mean a system V} of polynomial equations f1(x1, . . . , xm) =· · · = fp(x1, . . . , xm) = 0, for some poly-nomials f1(x1, . . . , xm), . . . , fp(x1, . . . , xm) with real coeﬃcients. We write [V ] := {(x1, . . . , xm)∈ Rm| f1(x1, . . . , xm) =· · · = fp(x1, . . . , xm) = 0}. If each polyno-mial fihas degree less than or equal to δ, then [V ] has at most δ(2δ−1)m−1 arcwise connected components.

Lemma 3.6. For all nonnegative integer d and for all ﬁnite sets Z ⊆ Rd_{, the} number of the arcwise connected components of Γd(Z) ={ E ∈ Γd| Z ⊂ ∂E } is at most 6· 11d2+2d−1_{< 11}d2+2d_.

Theorem 3.7. For all nonnegative integer d, VCdim(Γd)≤ (6.53486 · · · )(d2+ 3d). Proof. Assume that Γd shatters a set X⊆ Rd. Then for all Y ⊆ X there is E ∈ Γd such that

(∗) E◦∩ X ⊆ Y ⊆ E,

where E◦denotes the interior of E. For each Y ⊆ X, let EY be one of E ∈ Γd that satisﬁes (∗) and the cardinality of ∂E ∩ X is maximum.

Claim 3.8. ¯¯{Y ⊆ X | ∂EY ∩ X = Z }¯¯ ≤2|Z|11d

2_+2d

for each Z⊆ X.

By Lemma 3.4, we can perturb X if necessary and assume that ∂EY ∩ X is of size at most d(d+3)/2, which is less than or equal to L := (d2_{+3d)/2. Furthermore} d2_{+ 2d}_{≤ L. Therefore, the Claim implies that}

2|X|≤ X Z⊆X |Z|≤L 2|Z|11d2+2d ≤ L X j=0 µ |X| j ¶ 2j11d2+2d≤ 121L µ e|X| L ¶L ,

where e is the base of natural logarithm and the last inequality is by [2, Proposition A2.1]. Taking Lth roots of both sides and putting u =−|X| ln 2/L, we get e−u≤

(4)

4 YOHJI AKAMA

−242eu/ ln 2, where the equality is attained at u = −13.0697 · · · , thus |X| = −uL/ ln 2 ≤ (13.0697 · · · )L = (6.53486 · · · )(d2_{+ 3d).} _¤ Corollary 3.9. For all nonnegative integer d, VCdim(C1,d) = VCdim(Γd) = Θ(d2_{+ 3d).}

References

[1] Yohji Akama, Kei Irie, Akitoshi Kawamura, and Yasutaka Uwano. VC dimensions of statistical inferences. SIG-FPAI-801 of JSAI on June 2008, 2009. Submitted.

[2] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989. [3] John W. Milnor. On the Betti numbers of real varieties. Proc. Amer. Math. Soc., 15:275–280,

1964.

[4] John W. Milnor. Topology from the diﬀerentiable viewpoint. Princeton Landmarks in Mathe-matics. Princeton University Press, Princeton, NJ, 1997. Based on notes by David W. Weaver, Revised reprint of the 1965 original.

[5] D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical analysis of ﬁnite mixture

distributions. Wiley Series in Probability and Mathematical Statistics: Applied Probability

and Statistics. John Wiley & Sons Ltd., Chichester, 1985.

[6] Vladimir N. Vapnik. Statistical learning theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. John Wiley & Sons Inc., New York, 1998. A Wiley-Interscience Publication.

[7] Li-Wei Wang and Ju-Fu Feng. Learning Gaussian mixture models by structural risk mini-mization. In Proceedings of the Fourth International Confernece on Machine Learning and

Cybernetics, pages 18–21. IEEE, Aug 2005.

Mathematical Institute, Tohoku University, Aramaki, Aoba-ku, Sendai, Miyagi, 980-8578, Japan