Section 92 Schedule for 2017 Web Information Extraction and Retrieval Ch92

(1)

Mixture of Gaussians

Sargur Srihari

srihari@cedar.buffalo.edu

(2)

Mixtures of Gaussians

• Also Called as Finite Mixture Models (FMMs)

∑

=

Σ

=

K

k

k k

k

^N ^x

p

1

)

,

|

(

)

x

( π µ

Goal: Determine the three sets of parameters using maximum likelihood

(3)

Gaussian Mixture Model

• A linear superposition of Gaussian components

• Provides a richer class of density models than

the single Gaussian

• GMM are formulated in terms of discrete latent

variables

–Provides deeper insight –Motivates EM algorithm

•  Which gives a maximum likelihood solution to no. of components and their means/covariances

(4)

GMM Formulation

• Linear superposition of K Gaussians:

• Use 1-of-K representation for latent variable _z

–Let z = z₁,..,z_K whose elements are

–There are K possible states of z corresponding to K components

• Define joint distribution p(x,z)=p(x|z)p(z)

∑

=

Σ

=

K

k

k k

k

^N ^x

p

1

)

,

|

(

)

x

( π µ

marginal conditional

z_k ∈{0,1} and ^z_k _{= 1}

k

∑

(5)

p(z)

• We need to associate a probability with each

component _k

• Denote where parameters

satisfy

• Because z uses 1-of-K it follows that

–since components are mutually exclusive and hence are independent

1

(z)

^k

K

z k k

p π

=

= _∏

p(z

_k

_{= 1) =} ^π

_k

0 _≤ _π

k ^{≤ 1 and} ^πk _{= 1} k

∑

With one component p(z₁_{) =} ^π₁^z¹

With two components p(z₁, z₂_{) =} ^π₁^z¹^π₂^z²

(6)

Conditional distribution _p(x|z)

• For a particular component (value of z)

• Thus p(x|z) can be written in the form

• Thus marginal distribution of x is obtained by summing over all possible states of z

• This is the standard form of a Gaussian mixture

p(x | z) = ^{N x |}( ^µk^,Σk)^z^k

k=1 K

∏

p(x) = p(z) p(x | z) = π_k^z^k

k=1 K

∏

^{N x |}⁽ ^µ^k^,^Σ^k⁾^z^k

z

∑

⁼ ^π^k^{N x |}⁽ ^µ^k^,^Σ^k⁾

k=1 K

∑

z

∑

p(x | z_k = 1) = N (x | µ_k_{, Σ}_k ⁾

Due to the exponent z_kall product terms except for one equal one

(7)

7

Latent Variable

• If we have observations x₁,..,x_N

• Because marginal distribution is in the form

– It follows that for every observed data point x_n there is a corresponding latent vector z_n, i.e., its sub-class

• Thus we have found a formulation of Gaussian mixture involving an explicit latent variable

– We are now able to work with joint distribution p(x,z) instead of marginal p(x)

• Leads to significant simplification through introduction of expectation maximization

p(x) ₌ p(x,z)

z

∑

(8)

Another conditional probability (Responsibility)

• In EM p(z |x) plays a role

• The probability p(z_k=1|x) is denoted

• From Bayes theorem

• View as prior probability of component k as the posterior probability

it is also the responsibility that component k takes for explaining the observation x

γ(z_k) ≡ p(z_k = 1 | x) = ^p(z^k = 1) p(x | z_k = 1) p(z_j = 1) p(x | z _j = 1)

j=1 K

∑

= ^π^k^N(^{x |}^µ^k^,^Σ^k⁾ π_jN(x |µ_k^,Σ_j)

j=1 K

∑

γ ^(z_k ⁾

γ (z_k _{) = p(z}_k _{= 1| x)}

p(z_k _{= 1) =} ^π_k

(9)

Plan of Discussion

• Next we look at

1. How to get data from a mixture model synthetically and then

2. How to estimate the parameters from the data

(10)

Synthesizing data from mixture

• Use ancestral sampling

– Start with lowest numbered node and draw a sample,

•  Generate sample of z, called z^{^}

– move to successor node and draw a sample given the parent value, etc.

•  Then generate a value for x from conditional p(x|z^{^})

• Samples from p(x,z) are plotted

according to value of x and colored with value of z

• Samples from marginal p(x)

obtained by ignoring values of z

500 points from three Gaussians

Complete Data set

Incomplete Data set

(11)

11

Illustration of responsibilities

• Evaluate for every data point

–Posterior probability of each component

• Responsibility is

associated with data point x

_n

• Color using proportion of red, blue and

green ink

–If for a data point it is colored red

–If for another point it has equal blue and green and will

appear as cyan

γ ^(z_nk ⁾

γ ^(z_n₁_{) = 1}

γ ^(z_n₂_{) =} γ ^(z_n₃_{) = 0.5}

(12)

Maximum Likelihood for GMM

• We wish to model data set {x₁,..x_N} using a mixture of Gaussians (N items each of dimension D)

• Represent by _{N x D} matrix X

–_N^th row is given by x_n^T

• Represent N latent variables with N x K matrix Z with rows z_n^T

• Bayesian Network for p(x,z) is

We are given N i.i.d. samples {xn} with unknown latent points {zn} with parameters πn.

The samples have parameters: mean µ and covariance Σ

Goal is to estimate the three sets of parameters

X ₌ x₁ x₂ xⁿ

"

#

$ $

$

%

& ' ' ' '

Z = z1

z2

zn

"

#

$

%

& ' ' ' '

(13)

Likelihood Function for GMM

13

p(_{x) =} p(z) p(x | z) = ^π_k^N(x |^µk^,Σk)

k=1 K

∑

z

∑

Therefore Likelihood function is

p(X |^π,_µ_{,Σ) =}

k=1

∑

K ^π_kN(xn ^|^µk^,Σk⁾

⎧ ⎨

⎩

⎫ ⎬

n=1 ⎭

∏

N

Therefore log-likelihood function is

ln p(X |^π,_µ_{,Σ) =} ln

k=1

∑

⎧ ⎨

⎩

⎫ ⎬

n=1 ⎭

∑

N

Which we wish to maximize

A more difficult problem than for a single Gaussian

Mixture density function is

Summation is for

Marginalizing the joint

Product is over the N i.i.d. samples

(14)

Maximization of Log-Likelihood

• Goal is to estimate the three sets of

parameters

• Before proceeding with the m.l.e. briefly

mention two technical issues:

–Problem of singularities

–Identifiability of mixtures ₁₄

ln p(X | ^π , _µ _{, Σ) =} ln

k=1

∑

K

^π

k

^{N (x}

n

^| ^µ

k

^{, Σ}

k

⁾

⎧ ⎨

⎩

⎫ ⎬

⎭

n=1

∑

N

π _, _µ _{, Σ}

(15)

15

• Consider Gaussian mixture

– components with covariance matrices

• Data point that falls on a mean will contribute to the likelihood function

• As term goes to infinity

• Therefore maximization of log-likelihood is not well-posed

– Interestingly, this does not happen in the case of a single Gaussian

•  Multiplicative factors go to zero

– Does not happen in the Bayesian approach

• Problem is avoided using heuristics

– Resetting mean or covariance

2

1/2

1 1

(x | x , )

n n j (2 )

j

N σ I

π σ

=

One component assigns finite values and other to large value

Multiplicative values Take it to zero

σ j ^{→ 0}

Σk ₌ ^σk 2_I

µ_j _{= x}_n

(16)

Problem of Identifiability

A K-component mixture will have a total of K! equivalent solutions

– Corresponding to K! ways of assigning K sets of parameters to K components

•  E.g., for K=3 K!=6: 123, 132, 213, 231, 312, 321

– For any given point in the space of parameter values there will be a further K!-1 additional points all giving exactly same distribution

• However any of the equivalent solutions is as good as the other

A

B

C B

A C Two ways of labeling three Gaussian subclasses

A density p(x |θ) is identifiable if θ ≠θ ' then there is an x for which p(x |θ) ≠ p(x |θ')

(17)

EM for Gaussian Mixtures

• EM is a method for finding maximum likelihood solutions for models with latent variables

• Begin with log-likelihood function

• We wish to find that maximize this quantity

• Take derivatives in turn w.r.t

– Means and set to zero

– covariance matrices and set to zero – mixing coefficients and set to zero

ln p(X |^π,_µ_{,Σ) =} ln

k=1

∑

⎧ ⎨

⎩

⎫ ⎬

n=1 ⎭

∑

N

π _, _µ _{, Σ}

µ_k

Σ

^k

π

k

(18)

18

EM for GMM: Derivative wrt

• Begin with log-likelihood function

• Take derivative w.r.t the means and set to zero

– Making use of exponential form of Gaussian – Use formulas:

– We get

the posterior probabilities

ln p(X |^π,_µ_{,Σ) =} ln

k=1

∑

⎧ ⎨

⎩

⎫ ⎬

n=1 ⎭

∑

N

0 = ^π^k^N(xⁿ ^{| µ}^k^,Σ^k⁾ π _j_N(x_n _{| µ}_j_,Σ_j₎

∑

j n=1

N

∑

(x_n ^{− µ}_k

k

∑

−1 ⁾

d

dx ^{ln u =} u_'

u

and ^d dx

e^u _{= e}^uu_'

Inverse of covariance matrix

µ

_k

µ

k

γ ^(z ⁾

(19)

M.L.E. solution for Means

• Multiplying by (assuming non-

singularity)

• Where we have defined

–Which is the effective number of points assigned to cluster k

1

1 ( )x

N

k nk n

k n

z

µ N γ

=

_∑

Mean of k^th Gaussian component is the weighted mean of all the points in the data set:

where data point x_n is weighted by the posterior probability that

component k was responsible for generating x_n

N_k ₌ _γ_(z_nk₎

n=1 N

∑

Σ

^k

(20)

M.L.E. solution for Covariance

• Set derivative wrt to zero

–Making use of mle solution for covariance matrix of single Gaussian

–Similar to result for a single Gaussian for the data set but each data point weighted by the corresponding posterior probability

–Denominator is effective no of points in component

Σ_k = ¹

N_k ^γ(z^nk^)(xⁿ ^{− µ}^k^)(xⁿ ^{− µ}^k⁾

T n=1

N

∑

Σ

^k

(21)

M.L.E. solution for Mixing Coefficients

• Maximize w.r.t.

–Must take into account that mixing coefficients sum to one

–Achieved using Lagrange multiplier and maximizing

–Setting derivative wrt to zero and solving gives

k k

N

π

= N

ln p(X |_{π,µ,Σ) + λ} _π_k ₋₁

k=1

∑

K

⎛

⎝ ⎜ ^⎞

⎠ ⎟

ln p(X | ^π , _µ , Σ) ^π

_k

π

k

(22)

Summary of m.l.e. expressions

• GMM maximum likelihood estimates

22 1

1 ( )x

N

k nk n

k n

z

µ N γ

=

_∑

Σ_k = ¹

N_k ^γ(z^nk^)(xⁿ ^{− µ}^k^)(xⁿ ^{− µ}^k⁾

T n=1

N

∑

k k

N

π

= N ^Nk ₌ ^γ^(znk⁾ n=1

N

∑

Parameters (means)

Parameters(covariance matrices)

Parameters

(Mixing Coefficients)

All three are in terms of responsibliities

(23)

EM Formulation

• The results for are not closed form

solutions for the parameters

–Since the responsibilities depend on those parameters in a complex way

• Results suggest an iterative solution

• An instance of EM algorithm for the

particular case of GMM

µ_k^{, Σ}_k^,π_k γ ^(z_nk ⁾

(24)

Informal EM for GMM

• First choose initial values for means, covariances and mixing coefficients

• Alternate between following two updates

– Called E step and M step

• In E step use current value of parameters to evaluate posterior probabilities, or

responsibilities

• In the M step use these posterior probabilities to to re-estimate means, covariances and mixing coefficients

(25)

EM using Old Faithful

Data points and Initial mixture model

Initial E step Determine responsibilities

After first M step

Re-evaluate Parameters

After 2 cycles After 5 cycles After 20 cycles

(26)

Comparison with K-Means

K-means result E-M result

(27)

Animation of EM for Old Faithful Data

• http://en.wikipedia.org/wiki/

File:Em_old_faithful.gif

#initial parameter estimates (chosen to be deliberately bad) theta <- list( tau=c(0.5,0.5),

mu1=c(2.8,75), mu2=c(3.6,58),

sigma1=matrix(c(0.8,7,7,70),ncol=2), sigma2=matrix(c(0.8,7,7,70),ncol=2) ) Code in R

(28)

Practical Issues with EM

• Takes many more iterations than K-means

• Each cycle requires significantly more

comparison

• Common to run K-means first in order to

find suitable initialization

• Covariance matrices can be initialized to

covariances of clusters found by K-means

• EM is not guaranteed to find global

maximum of log likelihood function

(29)

Summary of EM for GMM

• Given a Gaussian mixture model

• Goal is to maximize the likelihood function

w.r.t. the parameters (means, covariances

and mixing coefficients)

Step1: Initialize the means , covariances

and mixing coefficients and evaluate

initial value of log-likelihood

µ

k

Σ

^k

π

k

(30)

EM continued

•_{Step 2:} E step: Evaluate responsibilities using current parameter values

•_{Step 3:} M Step: Re-estimate parameters using current responsibilities

1

(x | µ , ) ( )=

(x | µ , ))

k k k

k _K

j j j

j

z N

N γ π

π

=

Σ

∑ Σ

where

Σ_k^new = ¹

N_k ^γ^(z^nk^)(xⁿ ⁻^µ^k

new)(x_n ₋_µ_k^new)^T

n=1 N

∑

µ_k^new = ¹

N_k ^γ^(z^nk^)xⁿ

n=1 N

∑

N_k ₌ _γ_(z_nk₎

n=1 N

∑

πk

new = ^N^k N

(31)

EM Continued

• Step 4: Evaluate the log likelihood

–And check for convergence of either parameters or log likelihood

• If convergence not satisfied return to Step 2

ln p(X |^π,_µ_{,Σ) =} ln

k=1

∑

⎧ ⎨

⎩

⎫ ⎬

n=1 ⎭

∑

N

Section 92 Schedule for 2017 Web Information Extraction and Retrieval Ch92

Mixture of Gaussians

Sargur Srihari

srihari@cedar.buffalo.edu

Mixtures of Gaussians

• Also Called as Finite Mixture Models (FMMs)

∑

Σ

=

N x

p

)

,

|

(

)

x

( π µ

Gaussian Mixture Model

• A linear superposition of Gaussian components

• Provides a richer class of density models than

the single Gaussian

• GMM are formulated in terms of discrete latent

variables

GMM Formulation

• Linear superposition of K Gaussians:

• Use 1-of-K representation for latent variable z

• Define joint distribution p(x,z)=p(x|z)p(z)

∑

Σ

=

N x

p

)

,

|

(

)

x

( π µ

∑

p(z)

• We need to associate a probability with each

component k

• Denote where parameters

satisfy

• Because z uses 1-of-K it follows that

(z)

p π

= ∏

p(z

= 1) = π

∑

Conditional distribution p(x|z)

∏

∏

∑

∑

∑

Latent Variable

∑

Another conditional probability (Responsibility)

Plan of Discussion

• Next we look at

Synthesizing data from mixture

Illustration of responsibilities

• Evaluate for every data point

• Responsibility is

associated with data point x

• Color using proportion of red, blue and

green ink

Maximum Likelihood for GMM

Likelihood Function for GMM

∑

∑

∑

∏

∑

∑

Maximization of Log-Likelihood

^N ^x

• Use 1-of-K representation for latent variable _z

^N ^x

component _k

= _∏

_{= 1) =} ^π

Conditional distribution _p(x|z)

ln p(X | ^π , _µ _{, Σ) =} ln

^π

^{N (x}

^| ^µ

^{, Σ}

⁾

π _, _µ _{, Σ}

π _, _µ _{, Σ}

_∑

ln p(X | ^π , _µ , Σ) ^π

_∑