GAUSSIAN MIXTURE MODELS AND VC DIMENSIONS
YOHJI AKAMA
Abstract. We will establish the VC dimension of the class of translated ellip-soids in d-dimensional Euclidean space is between (d/2+1)2and (6.53486· · · )(d2+ 3d). By this, (d/2 + 1)2turns out to be a lower bound of the VC dimension of the concept class that N -component Gaussian mixture models induce through maximum likelihood estimate.
1. Introduction
A Gaussian Mixture Model (GMM) [5] is often used for probability density estimation. It is a probability density function which is linear combination of N Gaussian distributions where N is a predefined positive integer: The parameters of a GMM are usually estimated by the EM algorithm, which assumes that the number N of the components of the mixture is known a priori, but such information is often unavailable in practice.
For selecting N of GMMs, a typical model selection problem, [7] proposed to apply Structural Risk Minimization (SRM) principle [6], a principle to select the number N of components such that the so-called generalization error of maximum likelihood estimate is small. The generalization error depends on the VC dimen-sion of the concept classCN,d that N -component GMMs induce in the sense of [6, Theorem 5.4]. Here d is the dimension of the ambient space data come from. In [7], in analyzing the SRM for GMMs, the VC dimension ofCN,d is claimed to be O(N d log d).
In this report, we show that the VC dimension of CN,d is at least (d/2 + 1)2. For the purpose, we note CN,d contains C1,d, the class of translated ellipsoids of Rd. So the VC dimension ofC
N,d is greater than or equal to that of C1,d. Then we prove the VC dimension of C1,d is greater than or equal to (d/2 + 1)2. The result is derived from the lower bound of VC dimension of the class that principal component analysis induces [1]. Actually, we establish that the VC dimension of C1,dis between (d/2 + 1)2and (6.53486· · · )(d2+ 3d).
In the next section, the lower bounds of the VC dimension ofC1,d is given. In Section 3 we provides an upper bounds ofC1,d, by Milnor-Thom bound [3] as in [1]. Definition 1.1. For sets X ⊆ Rd and Y ⊆ X, we say that a set B ⊆ Rd cuts Y out of X if Y = X∩ B. A class C of subsets of Rd is said to shatter a set X⊆ Rd if every Y ⊆ X is cut out of X by some B ∈ C. The VC dimension of C, denoted by VCdim(C), is defined to be the maximum n (or ∞ if no such maximum exists) for which some subset ofRd of cardinality n is shattered byC.
Date: June 18, 2009.
1
人工知能学会研究会資料 SIG-DMSM-A901-08 (7/8)
2 YOHJI AKAMA
2. The Lower Bounds
Definition 2.1 ([1]). Let d and k < d be nonnegative integers. For a k-dimensional
affine subspace H ofRd and a nonnegative real number r, the set of points x∈ Rd such that the distance from x to H is less than or equal to r is called the band with centre H and radius r. DefineCd
k to be the class of all such bands. For example, Cd
0, C12, C31 and C23 are the classes of balls, usual bands, cylinders and slabs, respectively.
Proposition 2.2 ([1]). For any nonnegative integer d and k < d, (k+1)(d−k+1) ≤
VCdim(Ckd)≤ (8.7403 · · · )(k + 1)(d − k + 1).
Definition 2.3. Let d be nonnegative integer. Let Γd be the class of translated closed ellipsoids whose principal axes have distinct positive lengths.
For example Γ3be the class of{x ∈ R3|t(x− m)Udiag(s1, s2, s3)tU (x−m) ≤ 1} such that s1> s2> s3> 0, U = (u1u2u3)∈ O(3), the group of orthogonal matrix of size 3, and m∈ R3 are arbitrary. The principal axes areRu1,Ru2,Ru3and their lengths are s−1/21 , s−1/22 , s−1/23 respectively.
Fact 2.4. VCdim(Γd) = VCdim(C1,d).
Lemma 2.5. For any nonnegative integers d, if Sk≥0Cd
k shatters a set X ⊂ R d then Γd does so.
Proof. Suppose a finite set X ⊂ Rd is shattered by Cd
k. Let a subset Y be cut out of X by some band of Ckd such that the centre is the translation by some m ∈ Rd of some k-dimensional linear subspace W and the radius is r ≥ 0. A translated ellipsoid E defined below cuts Y out of X: the center is m, the principal axes contains an orthonormal basis of the orthogonal linear space W⊥, the lengths are approximately r, while the other other principal axes are sufficiently long. E belongs to Γd because the number of principal axes of positive length is d which is
greater than the dimension d− k of W⊥. ¤
Combined with Proposition 2.2 and Fact 2.4
Theorem 2.6. For all nonnegative integer d, VCdim(C1,d)≥ (d/2 + 1)2. 3. The Upper Bounds
Lemma 3.1. For all nonnegative integer d, let
Vd:= O(d)× Rd× (R>0)d, (1)
a smooth manifold of dimension d(d + 3)/2.
Definition 3.2. For nonnegative integer d, for any (A, m, s1, . . . , sd)∈ Vd, define a set
ec(A, m, s1, . . . , sd)
:={x ∈ Rd|t(x− m)A diag(s1, . . . , sd)tA(x− m) ≤ 1}.
ec(A, m, s1, . . . , sd−0) is the translation by m ∈ Rd of the ellipsoid where so-called principal axes are A·iR (1 ≤ i ≤ d) and the corresponding lengths are s−1i (1≤ i ≤ d).
GAUSSIAN MIXTURE MODELS AND VC DIMENSIONS 3
For all σ :{1, . . . , d} → {−1, 1}, define a function fσ onVd as follows: fσ((A, m, s1, . . . , sd)) = (A diag(σ(1), . . . , σ(d)), m, s1, . . . , sd) .
Then fσ is a function from Vd to itself, and the set of fσ such that σ is any function from {1, . . . , d} to {−1, 1} forms a finite smooth transformation group for Vd. Because if fσ fixes some element of Vd then fσ is the identity function, we see Vd/{fσ | σ : {1, . . . , d} → {−1, 1}} is a smooth manifold of dimension dimVd= d(d + 3)/2.
Because si’s are distinct, ec induces a bijection from the d(d + 3)/2-dimensional smooth manifoldVd/{fσ | σ : {1, . . . , d} → {−1, 1}} to Γd, the class of translated closed ellipsoids whose d principal axes have distinct positive lengths.
Lemma 3.3. Γd is a smooth manifold of dimension d(d + 3)/2.
For the Euclidean norm of vector a we write ∥a∥, for the boundary of a set U ⊆ Rd we write ∂U . By applying Sard’s Theorem [4] to a smooth manifold ©
(E, y1, . . . , yg)∈ Γd× (Rd)g | yi ∈ ∂E for all i ∈ S ª
of dimension d(d + 3)/2 + (d− 1)|S| + d(g − |S|), we can prove
Lemma 3.4. For all x1, . . . , xg∈ Rd and ε > 0, there exist y1, . . . , yg∈ Rd with ∥x1− y1∥, . . . , ∥xg− yg∥ < ε such that no set in Γd has more than d(d + 3)/2 of the points y1, . . . , yg on its boundary.
Proposition 3.5 (Milnor [3]). By a real affine variety inRm, we mean a system V of polynomial equations f1(x1, . . . , xm) =· · · = fp(x1, . . . , xm) = 0, for some poly-nomials f1(x1, . . . , xm), . . . , fp(x1, . . . , xm) with real coefficients. We write [V ] := {(x1, . . . , xm)∈ Rm| f1(x1, . . . , xm) =· · · = fp(x1, . . . , xm) = 0}. If each polyno-mial fihas degree less than or equal to δ, then [V ] has at most δ(2δ−1)m−1 arcwise connected components.
Lemma 3.6. For all nonnegative integer d and for all finite sets Z ⊆ Rd, the number of the arcwise connected components of Γd(Z) ={ E ∈ Γd| Z ⊂ ∂E } is at most 6· 11d2+2d−1< 11d2+2d.
Theorem 3.7. For all nonnegative integer d, VCdim(Γd)≤ (6.53486 · · · )(d2+ 3d). Proof. Assume that Γd shatters a set X⊆ Rd. Then for all Y ⊆ X there is E ∈ Γd such that
(∗) E◦∩ X ⊆ Y ⊆ E,
where E◦denotes the interior of E. For each Y ⊆ X, let EY be one of E ∈ Γd that satisfies (∗) and the cardinality of ∂E ∩ X is maximum.
Claim 3.8. ¯¯{Y ⊆ X | ∂EY ∩ X = Z }¯¯ ≤2|Z|11d
2+2d
for each Z⊆ X.
By Lemma 3.4, we can perturb X if necessary and assume that ∂EY ∩ X is of size at most d(d+3)/2, which is less than or equal to L := (d2+3d)/2. Furthermore d2+ 2d≤ L. Therefore, the Claim implies that
2|X|≤ X Z⊆X |Z|≤L 2|Z|11d2+2d ≤ L X j=0 µ |X| j ¶ 2j11d2+2d≤ 121L µ e|X| L ¶L ,
where e is the base of natural logarithm and the last inequality is by [2, Proposition A2.1]. Taking Lth roots of both sides and putting u =−|X| ln 2/L, we get e−u≤
4 YOHJI AKAMA
−242eu/ ln 2, where the equality is attained at u = −13.0697 · · · , thus |X| = −uL/ ln 2 ≤ (13.0697 · · · )L = (6.53486 · · · )(d2+ 3d). ¤ Corollary 3.9. For all nonnegative integer d, VCdim(C1,d) = VCdim(Γd) = Θ(d2+ 3d).
References
[1] Yohji Akama, Kei Irie, Akitoshi Kawamura, and Yasutaka Uwano. VC dimensions of statistical inferences. SIG-FPAI-801 of JSAI on June 2008, 2009. Submitted.
[2] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989. [3] John W. Milnor. On the Betti numbers of real varieties. Proc. Amer. Math. Soc., 15:275–280,
1964.
[4] John W. Milnor. Topology from the differentiable viewpoint. Princeton Landmarks in Mathe-matics. Princeton University Press, Princeton, NJ, 1997. Based on notes by David W. Weaver, Revised reprint of the 1965 original.
[5] D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical analysis of finite mixture
distributions. Wiley Series in Probability and Mathematical Statistics: Applied Probability
and Statistics. John Wiley & Sons Ltd., Chichester, 1985.
[6] Vladimir N. Vapnik. Statistical learning theory. Adaptive and Learning Systems for Signal Processing, Communications, and Control. John Wiley & Sons Inc., New York, 1998. A Wiley-Interscience Publication.
[7] Li-Wei Wang and Ju-Fu Feng. Learning Gaussian mixture models by structural risk mini-mization. In Proceedings of the Fourth International Confernece on Machine Learning and
Cybernetics, pages 18–21. IEEE, Aug 2005.
Mathematical Institute, Tohoku University, Aramaki, Aoba-ku, Sendai, Miyagi, 980-8578, Japan