Introduction to Automatic Speech Recognition

(1)

Speech and Language Processing

Lecture 2

Maximum likelihood estimation and EM algorithm

Information and Communications Engineering Course

(2)

Lecture Plan (Shinozaki’s part)

1. 10/20 (remote)

Speech recognition based on GMM, HMM, and N-gram

2. 10/27 (remote)

Maximum likelihood estimation and EM algorithm

3. 11/3 (remote)

Bayesian network and Bayesian inference 4. 11/6 (@TAIST)

Variational inference and sampling 5. 11/7 (@TAIST)

Neural network based acoustic and language models 6. 11/7 (@TAIST)

Weighted finite state transducer (WFST) and speech decoding

(3)

Today’s Topic

• Answers for the previous exercises

• Brief review of probability theory

• Maximum likelihood estimation

(4)

(5)

Exercise 1.1 (Answer)

Suppose W is a vowel and O is a MFCC feature vector.

Suppose that P_AM(O|W) is an acoustic model and P

LM(W) is a

language model. Obtain a vowel �� that maximizes P(W|O)

when the acoustic and language model log likelihoods are given as in the following table.

{ }

{

P

(

W

O

)

}

W

o e u i a W

|

max

arg

ˆ

, , , ,

∈

=

Vowel W a i u e o

log(P(O|W)) -13.4 -10.5 -30.1 -15.2 -17.0

log(P(W)) -1.61 -2.30 -1.61 -1.39 -1.39

Log(P(O|V))

(6)

Exercise 1.2(Answer)

• The following table defines a Bi-gram P(Word|Context)

C＼W today is sunny End

Start 0.6 0.1 0.2 0.1 today 0.1 0.5 0.3 0.1 is 0.1 0.1 0.7 0.1 sunny 0.1 0.1 0.2 0.6

Start today is sunny End

1.0x0.6x0.5x0.7x0.6=0.126

0.6 0.5 0.7 0.6

*P(Start)=1.0 Example ：

(7)

Exercise 1.2 (Cont.) (Answer)

• Based on the bigram definition of the previous slide,

compute the probability of the following sentences

1) P(“Start today sunny today sunny End”)

= 0.6*0.3*0.1*0.3*0.6 = 0.00324

2) P(“Start today today sunny sunny End”)

(8)

(9)

Probability Space

• Sample space (Ω)

• Set of all possible outcomes of an experiment

• Probability function (f(x))

• A function that maps each outcome to a probability

• Event (E)

• Subset of the sample space

( )

x ∈

[ ]

x∈Ω

f 0,1 for all

( )

=1

∑

_x_∈_Ω f x

( )

∑

= f x

E P( )

(10)

Random Variable

• A function that maps an outcome of an experiment

to a value

• Notation:

“P[X=x] = p” means “the probability of a random variable X takes a value x is p”

Example

Random Variable

1 2 3 4 5 6

X 1 2 3 4 5 6

Y 1 0 1 0 1 0

Z 0 0 1 1 1 1

X: The value of a die

(11)

Joint Probability

• Probability that more than one events jointly occur

Example

:

X

Dice 1

:

Y

Dice 2

:

)

,

(

X

i

Y

j

P

=

Probability that the value of X is i and

(12)

Conditional Probability

• Probability of an event given that another event

has occurred

Example

:

X

Color of the first ball

Randomly picks up two balls sequentially from a box containing 4 blue and 6 green balls

:

Y

Color of the second ball

9

3 )

|

(

9

4 )

|

(

=

blue

X

blue

Y

P

geen

X

blue

Y

(13)

Two Principal Rules

• Sum rule

• Summing joint probability P(X,Y) for all possible values of Y gives probability of P(X)

• P(X) is called the marginal probability

• Product rule

• Multiplying probability P(X) and joint probability P(Y|X)

is equal to joint probability P(X,Y)

∑

=

j

j i

i

P

X

x

Y

y

x

X

P

(

)

(

,

)

(

)

|

(

)

,

(

X

x

_i

Y

y

_j

P

Y

y

_j

X

x

_i

P

X

x

_i

(14)

Bayes’ Theorem

• From the product rule, we obtain:

i i i j j i i

j for x y

x X P y Y P y Y x X P x X y Y P , ) ( ) ( ) | ( ) | ( ∀ = = = = = = =

If we denote the distributions as P(X) etc., we have:

) ( ) ( ) | ( ) | ( X P Y P Y X P X Y P =

Using the sum rule, it can be expressed as:

(15)

Independence

• If the joint distribution of two variables X and Y

factorizes into the product of the marginals, then X

and Y are said to be “independent”

(

X Y

) ( ) ( ) ( )

P Y P X P Y P

Y X

P( , ) = | =

(

X Y

) ( ) ( ) ( )

P Y P X P Y P

Y X

P( , ) = | ≠

X and Y are independent

(16)

Probability Densities

• If the probability of a real-valued variable x falling

in the interval (x, x+δx) is given by p(x)δx when

δx→0, p(x) is called the probability density of x

x

δx

p(x)δx p(x)

p(x)

p(x)δx is probability

( )

≥ 0 and

∫

∞

( )

=1

∞

− p x dx

(17)

The Sum and The Product Rules For Continuous Variable

∑

=

j

j i

i

P

X

x

Y

y

x

X

P

(

)

(

,

)

(

)

|

(

)

,

(

X

x

_i

Y

y

_j

P

Y

y

_j

X

x

_i

P

X

x

_i

P

=

dy

y

x

p

x

p

(

)

=

∫

(

,

)

( ) ( )

y

x

p

x

p

y

x

(18)

Expectation

• Expectation of a function f(x) under a probability

distribution p(x) is denoted by E[f]

[ ]

( )

[ ]

f

p

x

f

( )

x

dx

E

x

f

x

p

f

E

x

∫

∑

=

)

(

)

(

(x is discrete)

(19)

Mean and Variance

• Mean

• Synonym of the expectation E[f(x)]

• Variance

• A measure of how much variability there is in f(x)

around its mean value E[f(x)]

• In particular, the variance of the variable x itself is:

[ ]

[

(

( )

[ ]

( )

)

2

]

[ ]

( )

2

[ ]

( )

2

var f ≡ E f x − E f x = E f x − E f x

[ ]

[

(

[ ]

)

2

]

(20)

Covariance

• Covariance

• The extent to which x and y vary together

[ ]

[

(

[ ]

)

(

[ ]

)

]

[ ] [ ] [ ]

xy

E

x

E

y

E

y

E

y

x

E

x

E

y

x

y x

−

=

−

≡

, ,

,

cov

(21)

Entropy

• Amount of randomness in the random variable

[ ]

x

E

[

( )

p

( )

x

]

p

( )

x

p

( )

x

H

x

log

=

−

∑

−

=

x 0 1

p(x) 0.5 0.5 Example

[ ]

693 . 0 ) 5 . 0 log( 5 . 0 ) 5 . 0 log( 5 . 0 = − − = x H

x 0 1

p(x) 0.1 0.9

(22)

Relative Entropy

• A measure of dissimilarity of two distributions p and q

• Also called as kullback-Leibler (KL) divergence

• KL(p||q) is nonnegative.

KL(p||q)=0 if and only if p(x)=q(x)

( )

_{( )}

( )

_{( )}

dx

x p

x q x

p

x q

x p E

q p

KL _p

  

 −

=

   

 

  

 =

∫

log log ||

(23)

(24)

Maximum Likelihood (ML) Method

• Assume that we have a set of samples D={x₁, x₂, … x_n}

drawn from a distribution p(x|θ) with unknown

parameters θ, and we want to estimate θ

• Maximum likelihood method estimates the parameters

by maximizing likelihood p(D|θ)

( )

_∏

( )

=

n

i

x

p

D

p

1

|

max

arg

|

max

arg

ˆ

θ

θ θ

(25)

Bernoulli Distribution

• Probability distribution of a binary random variable

which takes value 1 with probability μ and value 0

with probability 1-μ

x 0 1

Bern(x) 1-μ μ

( )

_x

( )

x

(26)

ML Estimation for Bernoulli Distribution

• Parameter θ in this case is : μ

• Training sample �_� = 0 �� 1

( ) ( ) ( ) ( ) ( ) ( )       ₊ ₋ ₋ =     ₋ = − = =

∑

∏

− = − i i i i i x x n i x x x x D p i i i i µ µ µ µ µ µ µ µ µ µ µ µ 1 log 1 log max arg 1 log max arg 1 max arg | max arg ˆ 1 1 1

( ) ( ) ( )

∑

= =       ₊ ₋ ₋ ∂∂ i i i i i i x n x x 1 0 1 log 1 log µ µ µ µ

Taking log does not

change the problem and makes the equation a bit easier

(27)

Example

• You tossed a winded coin 100 times, and got 62

heads and 38 tails. Estimate the probability μ of

getting head with the coin by the ML method

(

1

62

0

38 )

0 .

62

100

1

100

1 =

×

+

×

=

∑

i

x

(28)

Categorical Distribution

• As a generalization of the Bernoulli Distribution,

lets consider a discrete random variable X that

takes K values

28

X 1 2 … K

1-of-K

<x₁,x₂,…,x_K> <1,0,..,0> <0,1,..,0> … <0,0,..,1>

p(X) μ₁ μ₂ … μ_K

( )

_∏

=

K

k

x k

k

p

1

|

μ

µ

(29)

ML for Categorical Distribution

• Parameter θ in this case is : μ ={μ_1, μ_2,…_,μ_K}

• Training sample �_� is a vector of 1-of-K representation.

When �_� represents �-th value, �_�_,_�=1, and �_�_,_�=0 for

� ≠ �

( )

1 max arg max arg | max arg ˆ 1 1 1 1 , = = = =

∑

∏

∏∏

= = = = K k k K k m k n i K k x k k k i D p

µ

μ μ μ μ μ

m_kis the number of the occurrence of k-th value, where n is the number of samples

∑

₌ = n i k i k x m 1 ,

(30)

Solution

      _      ₋ + =

∑

= = K k k k K k k m 1 1 1 log max arg

ˆ

µ

λ

µ

(31)

Exercise 2.1

• Show the derivation process of obtaining

( )



  



 ₋

+

=

∑

= =

K

k

k k

K

k

m L

1 1

1 log

,λ µ λ µ

μ

n m_k

k =

µ

for the categorical distribution by maximizing

(32)

ML Estimation for Gaussian Distribution

(

)

( )      ₋ ₋ = 2 2 2 2 2 1 exp 2 1 , | µ σ πσ σ µ x x N

(

)

∫

−∞∞ | , =1

2

dx x

N µ σ

Parameter θ in this case is : {μ, σ} Training sample �_� is a real value

( )

_∑

(

( )

)

∏

₌ = ₌ = n i i n i

i N x

x N 1 1 | log max arg | max arg

ˆ θ θ

θ

θ θ

Gaussian distribution:

(33)

Exercise 2.2

(34)

ML Estimation for GMM with Known Index

• Let’s consider 2-mix GMM (component index m is 1 or 2)

• A training sample �_� =<�_�, �_�> is a pair of an observation �_� and an index of Gaussian component �_�, where i is a sample index

( )

(

)

(

)

( )

log( ( | , )) log( ( | , )), 1.0 log , | log , | log , , , , , 2 1 2 | 2 2 1 | 1 1 1 1 1 2 2 2 1 1 1 = + + + = = =

∑

∏

= = = = = w w o N o N w o N w o N w w w L i i i i i i i i i m i i m i i n i m n i m m i m n i m m i m σ µ σ µ σ µ σ µ σ µ σ µ

(

1 1 1 2 2 2

)

, , , , , , , , , , max arg 2 2 2 1 1 1 w w L w

w µ σ µ σ µ σ

σ µ

( )

( ) ( ) ( ) ( )

∑

= = = 2 | 2 2 , 1 | 1 1 , 1 , , | log max arg , | log max arg log max arg 2 2 1 1 2 1 i i i m i i m i i n i m w w o N o N w σ µ σ µ σ µ σ µ

(35)

ML Estimation for HMM with Known Path

• Both observation and state sequences are given

• Transition probability：Transition probability from state i to j is obtained

by dividing the number of transitions from state i to j by the number of transition from state i

• Emission probability：ML estimate of the emission distribution based on

the observations assigned to the state

s

₁

s

₂ s₃

s₀

HMM: Λ

Example： (Observation is a binary value taking ‘a’ or ‘b’)

When O=(a,b,a,b,b) and K=(s₀,s₁,s₁,s₂,s₂,s₂,s₃) Transition probability

p(s₁→s₁) =1/2, p(s₁→s₂)=1/2 p(s₂→s₂) =2/3, p(s₂→s₃)=1/3 Emission probability

(36)

Exercise 2.3

• Given a training data D with n training samples

D={x₁, x₂,…,x_n}, obtain ML estimation for GMM

with M mixtures.

You can assume the variance is 1 for simplicity.

(37)

???

Exercise 2.3 (Cont.)

=

∂

m

L

µ

(

)

    

  

    

  

  

 ₋

−

=

∑

= =

M m

m i

m N

i

X w

L

1

2

1 2

exp 2

1

log µ

(38)

(39)

Hidden Variable

• The mixture weight �_�of GMM can be regarded as a

probability �(�) since it is non-negative and sum to one

• Then, the GMM can be seen as a marginal probability of

� �, � = � � �(�|�_�, σ_�)

• In general, when a probability model is defined as a

marginal probability, the summed-out variable is not

seen from the outside, and it is called a hidden variable

• The mixture weight of GMM is a hidden variable

( )

_∑

(

)

_∑

( ) (

)

=

= =

= M

m

m m

M m

m m

mN x P m N x

w x

GMM

1 1

, |

,

(40)

ML Estimation for Models with Hidden Variables

• The summation Σ for the marginalization is often

problematic for optimization

(

)

[

]

(

)

(

)

_



 _Θ

=

  

 _Θ

=

Θ =

Θ

∑ ∑

∑

Θ Θ

Θ

i h

i i

i

h x P x P D P

| , log

max arg

| log

max arg

| max

arg ˆ

(41)

Jensen Lower Bound of Likelihood

(

)

(

)

( ) (

_{( )}

)

( )

(

_{( )}

)

H q

H X P H

q

H q

H X P H

q

H X P X

P

H

H H

Θ ≥

Θ =

Θ

∑

| ,

log

| ,

log

| ,

log |

log

For arbitrary �(�)

By Jensen’s inequality

( )

P

(

X H Θ

)

≡

Θ

∑

, |

This inequality holds for arbitrary � and arbitrary Θ Let � be an observed variable, H be a hidden variable,

(42)

Exercise 2.4

• Assume you have an initial model parameter Θ₀.

Prove that if you take � � = �₀ � = �(�|�, Θ₀),

then the lower bound �(�₀, Θ₀) is equal to the log likelihood ��(�|Θ₀)

(

| 0

)

(

| , 0

)

, 0

)

(43)

Maximization of the Lower Bound

• Assume we have an initial model parameter Θ₀ .

(

)

(

Θ Θ

)

= Θ

Θmax | , ,

arg ₀

1 J P H X

(

| 0

)

log

(

| 1

)

log P X Θ ≤ P X Θ

(

| 0

)

(

| , 0

)

, 0

)

log P X Θ = J P H X Θ Θ

Because: _J

(

_P

(

_H _| _X _,Θ

)

_,Θ

)

≤ _log _P

(

_X _| Θ

)

_for ∀Θ

0

(

)

(

P H | X,Θ0 ,Θ0

)

≤ J

(

P

(

H | X,Θ0

)

,Θ1

)

J

By maximizing the lower bound J with respective to Θ, we can find Θ₁

(44)

Relation Between the Likelihood and the Lower Bound

( )

(

)

( )

Θ ≡

(

Θ

)

Θ

)

Θ ≡

Θ

, ,

| | log

0 X

H P J g

X P f

1

Θ 0

Θ

ｆ

g

0 1 = Θ

Θ

ｆ

(45)

Q-function

(

Θ,Θ₀

)

≡

∑

P

(

H | X ,Θ₀

)

log P

(

X , H | Θ

)

Q H Let ( ) ( ) ( ) ₍( )₎ ( ) ( ) ( ) ( )

( 0)

0 0 0 0 0 0 , max arg | , log , | | , log , | max arg , | | , log , | max arg , , | max arg Θ Θ =     

 _Θ _Θ ₋ _Θ _Θ

= ΘΘ Θ = Θ Θ Θ Θ Θ Θ

∑

Q H X P X H P H X P X H P X H P H X P X H P X H P J H H H

Finding the argmax of J is equal to finding the argmax of

(46)

Expectation Maximization (EM) Algorithm

1. Prepare an initial parameter (or parameter set) Θ₀

2. Given a parameter Θ_t , obtain a Q-function

Q(Θ, Θt), which is an expectation of the log joint

probability log�(�, �|Θ) with � � �, Θ_t

[E-step]

3. Maximizing the Q-function Q(Θ, Θ_t) and obtain an

updated parameter Θt+1

[M-step]

(47)

The Process of the EM Algorithm

(

)



 _Θ

=

Θ

∑

0

Θ Initial model parameters.

May be just a random number.

( )

[

0

]

1 = arg max Θ,Θ

Θ

Θ Q

Update the parameters

( )

[

1

]

2 = arg max Θ,Θ

Θ

Θ Q

(

)

[

2

]

3 = arg max Θ,Θ Θ

Θ Q

(48)

EM for GMM

• Let’s consider 2-mix GMM

• Assume a training data D with n training samples

D={o₁, o₂,…,o_n} , and an initial parameter set

Θ₀={μ₁, σ₁, w₁, μ₂, σ₂, w₂} are given

• The posterior probability P(m_i|D,Θ₀) of the component index m for the i-th training sample is :

( ) ( ) ( ) ( )

(

)

( )

∑

₌ = ₌ = = 2 1 2 1 0 0 0 0 , | , | | , | , , | , | m m m i m m m i m m i i i i i i o N w o N w o m P o m P o m P D m

P i i i

σ µ σ µ Θ Θ Θ Θ

(Θ Θ )=

∑

( Θ ) ( Θ)=

∑∑

( Θ ) ( Θ)

= = > =< | , log , | | , log , | , 1 2 1 0 , , 0 0 2 1 m o P o m P M D P D M P Q _i n i m i m m m

(49)

Exercise 2.5

• Consider the 2-mix GMM of the previous page.

Let γ_� � = �(�|��, Θ₀). Obtains the followings.

(

)

(

)

(

0

)

1

0 1

, max

arg ˆ

, max

arg ˆ

, max

arg ˆ

1 1 1

Θ Θ =

Q w

Q Q

w

σ µ

(50)

EM Estimation for HMM with Unknown Path

50

s₁ s₂ s₃

[p(a),p(b)] = S1: [0.7, 0.3] S2: [0.6, 0.4]

s₀

1.0

0.6 _0.2

0.4 _0.8

Initial HMM: Λ When O=(a,b,b), possible paths are:

K1(s₀,s₁,s₁,s₂,s₃) and K2(s₀,s₁,s₂,s₂,s₃)

a b b a b b

007168 . 0 ) | , ( 016128 . 0 ) | , ( 2 1 = Λ = Λ K P K P O O ( )

( ) ( ( )) 0.7, ( | , ) 0.3 007168 . 0 016128 . 0 016128 . 0 | , | , | | , ) , |

( 1 1 ₂

1 ≈ Λ ≈

+ = Λ Λ = ΛΛ = Λ

∑P K O P K O O K P O P O K P O K P K Posterior probability

Expectations of transitions

n(s₁→s₁)=1*0.7+0*0.3=0.7 n(s₁→s₂)=1*0.7+1*0.3=1 n(s₂→s₂)=0*0.7+1*0.3=0.3 n(s₂→s₃)=1*0.7+1*0.3=1

E-step (expectation step)

M-step (maximization step)

Expectations of emissions

n(a|s₁) =1*0.7+1*0.3=1 n(b|s₁) =1*0.7+0*0.3=0.7 n(a|s₂) =0*0.7+0*0.3=0 n(b|s₂) =1*0.7+2*0.3=1.3

New transition probabilities

p(s₁→s₁)=0.7/(0.7+1)=0.41 p(s₁→s₂)=1/(0.7+1)=0.59

p(s₂→s₂)=0.3/(0.3+1)=0.23 p(s₂→s₃)=1/(0.3+1)=0.77

New emission probabilites

p(a|s₁)=1/(1+0.7)=0.59 p(b|s₁)=0.7/(1+0.7)=0.41

p(a|s₂)=0/(0+1.3)=0.00 p(b|s₂)=1.3/(0+1.3)=1.00

(51)

(52)

Method of Lagrange Multiplier

Maximize f(X) subject to g(X)=0

Maximize f(X)-λg(X)

with respect to X and λ,

where λ is a new parameter

(53)

Jensen’s Inequality

• If f(x) is a concave function, the following equation

holds for arbitrary probability distribution of i

( ) ( )

( )













≤

∑

i

i i

i

x

i

p

f

x

f

i

p

Example：

( )1 0.6 ( ) (2 0.4 1 0.6 2)

4 .

0 f x + f x ≤ f x + x

Weighted average of function value

f(x)