Speech and Language Processing
Lecture 2
Maximum likelihood estimation and EM algorithm
Information and Communications Engineering Course
Lecture Plan (Shinozaki’s part)
1. 10/20 (remote)
Speech recognition based on GMM, HMM, and N-gram
2. 10/27 (remote)
Maximum likelihood estimation and EM algorithm
3. 11/3 (remote)
Bayesian network and Bayesian inference 4. 11/6 (@TAIST)
Variational inference and sampling 5. 11/7 (@TAIST)
Neural network based acoustic and language models 6. 11/7 (@TAIST)
Weighted finite state transducer (WFST) and speech decoding
Today’s Topic
• Answers for the previous exercises
• Brief review of probability theory
• Maximum likelihood estimation
Exercise 1.1 (Answer)
Suppose W is a vowel and O is a MFCC feature vector.
Suppose that PAM(O|W) is an acoustic model and P
LM(W) is a
language model. Obtain a vowel �� that maximizes P(W|O)
when the acoustic and language model log likelihoods are given as in the following table.
{ }
{
P
(
W
O
)
}
W
o e u i a W
|
max
arg
ˆ
, , , ,
∈
=
Vowel W a i u e o
log(P(O|W)) -13.4 -10.5 -30.1 -15.2 -17.0
log(P(W)) -1.61 -2.30 -1.61 -1.39 -1.39
Log(P(O|V))
Exercise 1.2(Answer)
• The following table defines a Bi-gram P(Word|Context)
C\W today is sunny End
Start 0.6 0.1 0.2 0.1 today 0.1 0.5 0.3 0.1 is 0.1 0.1 0.7 0.1 sunny 0.1 0.1 0.2 0.6
Start today is sunny End
1.0x0.6x0.5x0.7x0.6=0.126
0.6 0.5 0.7 0.6
*P(Start)=1.0 Example :
Exercise 1.2 (Cont.) (Answer)
• Based on the bigram definition of the previous slide,
compute the probability of the following sentences
1) P(“Start today sunny today sunny End”)
= 0.6*0.3*0.1*0.3*0.6 = 0.00324
2) P(“Start today today sunny sunny End”)
Probability Space
• Sample space (Ω)
• Set of all possible outcomes of an experiment
• Probability function (f(x))
• A function that maps each outcome to a probability
• Event (E)
• Subset of the sample space
( )
x ∈[ ]
x∈Ωf 0,1 for all
( )
=1∑
x∈Ω f x( )
∑
= f x
E P( )
Random Variable
• A function that maps an outcome of an experiment
to a value
• Notation:
“P[X=x] = p” means “the probability of a random variable X takes a value x is p”
Example
Random Variable
1 2 3 4 5 6
X 1 2 3 4 5 6
Y 1 0 1 0 1 0
Z 0 0 1 1 1 1
X: The value of a die
Joint Probability
• Probability that more than one events jointly occur
Example
:
X
Dice 1:
Y
Dice 2:
)
,
(
X
i
Y
j
P
=
=
Probability that the value of X is i andConditional Probability
• Probability of an event given that another event
has occurred
Example
:
X
Color of the first ballRandomly picks up two balls sequentially from a box containing 4 blue and 6 green balls
:
Y
Color of the second ball9
3
)
|
(
9
4
)
|
(
=
=
=
=
=
=
blue
X
blue
Y
P
geen
X
blue
Y
Two Principal Rules
• Sum rule
• Summing joint probability P(X,Y) for all possible values of Y gives probability of P(X)
• P(X) is called the marginal probability
• Product rule
• Multiplying probability P(X) and joint probability P(Y|X)
is equal to joint probability P(X,Y)
∑
=
=
=
=
j
j i
i
P
X
x
Y
y
x
X
P
(
)
(
,
)
)
(
)
|
(
)
,
(
X
x
iY
y
jP
Y
y
jX
x
iP
X
x
iBayes’ Theorem
• From the product rule, we obtain:
i i i j j i i
j for x y
x X P y Y P y Y x X P x X y Y P , ) ( ) ( ) | ( ) | ( ∀ = = = = = = =
If we denote the distributions as P(X) etc., we have:
) ( ) ( ) | ( ) | ( X P Y P Y X P X Y P =
Using the sum rule, it can be expressed as:
Independence
• If the joint distribution of two variables X and Y
factorizes into the product of the marginals, then X
and Y are said to be “independent”
(
X Y) ( ) ( ) ( )
P Y P X P Y PY X
P( , ) = | =
(
X Y) ( ) ( ) ( )
P Y P X P Y PY X
P( , ) = | ≠
X and Y are independent
Probability Densities
• If the probability of a real-valued variable x falling
in the interval (x, x+δx) is given by p(x)δx when
δx→0, p(x) is called the probability density of x
x
δx
p(x)δx p(x)
p(x)
p(x)δx is probability
( )
≥ 0 and∫
∞( )
=1∞
− p x dx
The Sum and The Product Rules For Continuous Variable
∑
=
=
=
=
j
j i
i
P
X
x
Y
y
x
X
P
(
)
(
,
)
)
(
)
|
(
)
,
(
X
x
iY
y
jP
Y
y
jX
x
iP
X
x
iP
=
=
=
=
=
=
dy
y
x
p
x
p
(
)
=
∫
(
,
)
( ) ( )
y
x
p
x
p
y
x
Expectation
• Expectation of a function f(x) under a probability
distribution p(x) is denoted by E[f]
[ ]
( )
[ ]
f
p
x
f
( )
x
dx
E
x
f
x
p
f
E
x
∫
∑
=
=
)
(
)
(
(x is discrete)Mean and Variance
• Mean
• Synonym of the expectation E[f(x)]
• Variance
• A measure of how much variability there is in f(x)
around its mean value E[f(x)]
• In particular, the variance of the variable x itself is:
[ ]
[
(
( )
[ ]
( )
)
2]
[ ]
( )
2[ ]
( )
2var f ≡ E f x − E f x = E f x − E f x
[ ]
[
(
[ ]
)
2]
Covariance
• Covariance
• The extent to which x and y vary together
[ ]
[
(
[ ]
)
(
[ ]
)
]
[ ] [ ] [ ]
xy
E
x
E
y
E
y
E
y
x
E
x
E
y
x
y x
y x
−
=
−
−
≡
, ,
,
cov
Entropy
• Amount of randomness in the random variable
[ ]
x
E
[
( )
p
( )
x
]
p
( )
x
p
( )
x
H
x
log
log
=
−
∑
−
=
x 0 1
p(x) 0.5 0.5 Example
[ ]
693 . 0 ) 5 . 0 log( 5 . 0 ) 5 . 0 log( 5 . 0 = − − = x Hx 0 1
p(x) 0.1 0.9
Relative Entropy
• A measure of dissimilarity of two distributions p and q
• Also called as kullback-Leibler (KL) divergence
• KL(p||q) is nonnegative.
KL(p||q)=0 if and only if p(x)=q(x)
( )
( )
( )
( )
( )
( )
dxx p
x q x
p
x q
x p E
q p
KL p
−
=
=
∫
log log ||Maximum Likelihood (ML) Method
• Assume that we have a set of samples D={x1, x2, … xn}
drawn from a distribution p(x|θ) with unknown
parameters θ, and we want to estimate θ
• Maximum likelihood method estimates the parameters
by maximizing likelihood p(D|θ)
( )
∏
( )
=
=
=
ni
i
x
p
D
p
1
|
max
arg
|
max
arg
ˆ
θ
θ
θ
θ θ
Bernoulli Distribution
• Probability distribution of a binary random variable
which takes value 1 with probability μ and value 0
with probability 1-μ
x 0 1
Bern(x) 1-μ μ
( )
x( )
xx
ML Estimation for Bernoulli Distribution
• Parameter θ in this case is : μ
• Training sample �� = 0 �� 1
( ) ( ) ( ) ( ) ( ) ( ) + − − = − = − = =
∑
∑
∏
∏
− = − i i i i i x x n i x x x x D p i i i i µ µ µ µ µ µ µ µ µ µ µ µ 1 log 1 log max arg 1 log max arg 1 max arg | max arg ˆ 1 1 1( ) ( ) ( )
∑
∑
∑
= = + − − ∂∂ i i i i i i x n x x 1 0 1 log 1 log µ µ µ µTaking log does not
change the problem and makes the equation a bit easier
Example
• You tossed a winded coin 100 times, and got 62
heads and 38 tails. Estimate the probability μ of
getting head with the coin by the ML method
(
1
62
0
38
)
0
.
62
100
1
100
1
=
×
+
×
=
=
∑
i
i
x
Categorical Distribution
• As a generalization of the Bernoulli Distribution,
lets consider a discrete random variable X that
takes K values
28
X 1 2 … K
1-of-K
<x1,x2,…,xK> <1,0,..,0> <0,1,..,0> … <0,0,..,1>
p(X) μ1 μ2 … μK
( )
∏
=
=
Kk
x k
k
p
1
|
μ
µ
ML for Categorical Distribution
• Parameter θ in this case is : μ ={μ1, μ2, …, μK}
• Training sample �� is a vector of 1-of-K representation.
When �� represents �-th value, ��,�=1, and ��,�=0 for
� ≠ �
( )
1 max arg max arg | max arg ˆ 1 1 1 1 , = = = =∑
∏
∏∏
= = = = K k k K k m k n i K k x k k k i D pµ
µ
µ
μ μ μ μ μmkis the number of the occurrence of k-th value, where n is the number of samples
∑
= = n i k i k x m 1 ,Solution
− + =∑
∑
= = K k k k K k k m 1 1 1 log max argˆ
µ
λ
µ
Exercise 2.1
• Show the derivation process of obtaining
( )
−
+
=
∑
∑
= =
K
k
k k
K
k
k
m L
1 1
1 log
,λ µ λ µ
μ
n mk
k =
µ
for the categorical distribution by maximizing
ML Estimation for Gaussian Distribution
(
)
( ) − − = 2 2 2 2 2 1 exp 2 1 , | µ σ πσ σ µ x x N(
)
∫
−∞∞ | , =12
dx x
N µ σ
Parameter θ in this case is : {μ, σ} Training sample �� is a real value
( )
∑
(
( )
)
∏
= = = = n i i n ii N x
x N 1 1 | log max arg | max arg
ˆ θ θ
θ
θ θ
Gaussian distribution:
Exercise 2.2
ML Estimation for GMM with Known Index
• Let’s consider 2-mix GMM (component index m is 1 or 2)
• A training sample �� =<��, ��> is a pair of an observation �� and an index of Gaussian component ��, where i is a sample index
( )
(
)
(
(
)
)
( )
log( ( | , )) log( ( | , )), 1.0 log , | log , | log , , , , , 2 1 2 | 2 2 1 | 1 1 1 1 1 2 2 2 1 1 1 = + + + = = =∑
∑
∑
∑
∏
= = = = = w w o N o N w o N w o N w w w L i i i i i i i i i m i i m i i n i m n i m m i m n i m m i m σ µ σ µ σ µ σ µ σ µ σ µ(
1 1 1 2 2 2)
, , , , , , , , , , max arg 2 2 2 1 1 1 w w L w
w µ σ µ σ µ σ
σ µ
( )
( ) ( ) ( ) ( )∑
∑
∑
= = = 2 | 2 2 , 1 | 1 1 , 1 , , | log max arg , | log max arg log max arg 2 2 1 1 2 1 i i i m i i m i i n i m w w o N o N w σ µ σ µ σ µ σ µML Estimation for HMM with Known Path
• Both observation and state sequences are given
• Transition probability:Transition probability from state i to j is obtained
by dividing the number of transitions from state i to j by the number of transition from state i
• Emission probability:ML estimate of the emission distribution based on
the observations assigned to the state
s
1s
2 s3s0
HMM: Λ
Example: (Observation is a binary value taking ‘a’ or ‘b’)
When O=(a,b,a,b,b) and K=(s0,s1,s1,s2,s2,s2,s3) Transition probability
p(s1→s1) =1/2, p(s1→s2)=1/2 p(s2→s2) =2/3, p(s2→s3)=1/3 Emission probability
Exercise 2.3
• Given a training data D with n training samples
D={x1, x2,…,xn}, obtain ML estimation for GMM
with M mixtures.
You can assume the variance is 1 for simplicity.
???
Exercise 2.3 (Cont.)
=
∂
∂
mL
µ
(
)
−
−
=
∑
∑
= =
M m
m i
m N
i
X w
L
1
2
1 2
exp 2
1
log µ
Hidden Variable
• The mixture weight ��of GMM can be regarded as a
probability �(�) since it is non-negative and sum to one
• Then, the GMM can be seen as a marginal probability of
� �, � = � � �(�|��, σ�)
• In general, when a probability model is defined as a
marginal probability, the summed-out variable is not
seen from the outside, and it is called a hidden variable
• The mixture weight of GMM is a hidden variable
( )
∑
(
)
∑
( ) (
)
=
= =
= M
m
m m
M m
m m
mN x P m N x
w x
GMM
1 1
, |
,
ML Estimation for Models with Hidden Variables
• The summation Σ for the marginalization is often
problematic for optimization
(
)
[
]
(
)
(
)
Θ
=
Θ
=
Θ =
Θ
∑ ∑
∑
Θ Θ
Θ
i h
i i
i
h x P x P D P
| , log
max arg
| log
max arg
| max
arg ˆ
Jensen Lower Bound of Likelihood
(
)
(
)
( ) (
( )
)
( )
(
( )
)
H q
H X P H
q
H q
H X P H
q
H X P X
P
H
H H
Θ ≥
Θ =
Θ =
Θ
∑
∑
∑
| ,
log
| ,
log
| ,
log |
log
For arbitrary �(�)
By Jensen’s inequality
( )
( )
P(
X H Θ)
≡
Θ
∑
, |This inequality holds for arbitrary � and arbitrary Θ Let � be an observed variable, H be a hidden variable,
Exercise 2.4
• Assume you have an initial model parameter Θ0 .
Prove that if you take � � = �0 � = �(�|�, Θ0),
then the lower bound �(�0, Θ0) is equal to the log likelihood ����(�|Θ0)
(
| 0)
(
(
| , 0)
, 0)
Maximization of the Lower Bound
• Assume we have an initial model parameter Θ0 .
(
)
(
Θ Θ)
= Θ
Θmax | , ,
arg 0
1 J P H X
(
| 0)
log(
| 1)
log P X Θ ≤ P X Θ
(
| 0)
(
(
| , 0)
, 0)
log P X Θ = J P H X Θ Θ
Because: J
(
P(
H | X ,Θ)
,Θ)
≤ log P(
X | Θ)
for ∀Θ0
(
)
(
P H | X,Θ0 ,Θ0)
≤ J(
P(
H | X,Θ0)
,Θ1)
J
By maximizing the lower bound J with respective to Θ, we can find Θ1
Relation Between the Likelihood and the Lower Bound
( )
(
)
( )
Θ ≡(
(
Θ)
Θ)
Θ ≡
Θ
, ,
| | log
0 X
H P J g
X P f
1
Θ 0
Θ
f
g
0 1 = Θ
Θ
f
Q-function
(
Θ,Θ0)
≡∑
P(
H | X ,Θ0)
log P(
X , H | Θ)
Q H Let ( ) ( ) ( ) (( )) ( ) ( ) ( ) ( )
( 0)
0 0 0 0 0 0 , max arg | , log , | | , log , | max arg , | | , log , | max arg , , | max arg Θ Θ =
Θ Θ − Θ Θ
= ΘΘ Θ = Θ Θ Θ Θ Θ Θ
∑
∑
∑
Q H X P X H P H X P X H P X H P H X P X H P X H P J H H HFinding the argmax of J is equal to finding the argmax of
Expectation Maximization (EM) Algorithm
1. Prepare an initial parameter (or parameter set) Θ0
2. Given a parameter Θt , obtain a Q-function
Q(Θ, Θt), which is an expectation of the log joint
probability log�(�, �|Θ) with � � �, Θt
[E-step]
3. Maximizing the Q-function Q(Θ, Θt) and obtain an
updated parameter Θt+1
[M-step]
The Process of the EM Algorithm
(
)
Θ
=
Θ
∑
0
Θ Initial model parameters.
May be just a random number.
( )
[
0]
1 = arg max Θ,Θ
Θ
Θ Q
Update the parameters
( )
[
1]
2 = arg max Θ,Θ
Θ
Θ Q
Update the parameters
(
)
[
2]
3 = arg max Θ,Θ Θ
Θ Q
Update the parameters
EM for GMM
• Let’s consider 2-mix GMM
• Assume a training data D with n training samples
D={o1, o2,…,on} , and an initial parameter set
Θ0={μ1, σ1, w1, μ2, σ2, w2} are given
• The posterior probability P(mi|D,Θ0) of the component index m for the i-th training sample is :
( ) ( ) ( ) ( )
(
)
( )∑
∑
= = = = = 2 1 2 1 0 0 0 0 , | , | | , | , , | , | m m m i m m m i m m i i i i i i o N w o N w o m P o m P o m P D mP i i i
σ µ σ µ Θ Θ Θ Θ
(Θ Θ )=
∑
( Θ ) ( Θ)=∑∑
( Θ ) ( Θ)= = > =< | , log , | | , log , | , 1 2 1 0 , , 0 0 2 1 m o P o m P M D P D M P Q i n i m i m m m
Exercise 2.5
• Consider the 2-mix GMM of the previous page.
Let γ� � = �(�|��, Θ0). Obtains the followings.
(
)
(
)
(
0)
1
0 1
0 1
, max
arg ˆ
, max
arg ˆ
, max
arg ˆ
1 1 1
Θ Θ =
Θ Θ =
Θ Θ =
Q w
Q Q
w
σ µ
EM Estimation for HMM with Unknown Path
50
s1 s2 s3
[p(a),p(b)] = S1: [0.7, 0.3] S2: [0.6, 0.4]
s0
1.0
0.6 0.2
0.4 0.8
Initial HMM: Λ When O=(a,b,b), possible paths are:
K1(s0,s1,s1,s2,s3) and K2(s0,s1,s2,s2,s3)
a b b a b b
007168 . 0 ) | , ( 016128 . 0 ) | , ( 2 1 = Λ = Λ K P K P O O ( )
( ) ( ( )) 0.7, ( | , ) 0.3 007168 . 0 016128 . 0 016128 . 0 | , | , | | , ) , |
( 1 1 2
1 ≈ Λ ≈
+ = Λ Λ = ΛΛ = Λ
∑P K O P K O O K P O P O K P O K P K Posterior probability
Expectations of transitions
n(s1→s1)=1*0.7+0*0.3=0.7 n(s1→s2)=1*0.7+1*0.3=1 n(s2→s2)=0*0.7+1*0.3=0.3 n(s2→s3)=1*0.7+1*0.3=1
E-step (expectation step)
M-step (maximization step)
Expectations of emissions
n(a|s1) =1*0.7+1*0.3=1 n(b|s1) =1*0.7+0*0.3=0.7 n(a|s2) =0*0.7+0*0.3=0 n(b|s2) =1*0.7+2*0.3=1.3
New transition probabilities
p(s1→s1)=0.7/(0.7+1)=0.41 p(s1→s2)=1/(0.7+1)=0.59
p(s2→s2)=0.3/(0.3+1)=0.23 p(s2→s3)=1/(0.3+1)=0.77
New emission probabilites
p(a|s1)=1/(1+0.7)=0.59 p(b|s1)=0.7/(1+0.7)=0.41
p(a|s2)=0/(0+1.3)=0.00 p(b|s2)=1.3/(0+1.3)=1.00
Method of Lagrange Multiplier
Maximize f(X) subject to g(X)=0
Maximize f(X)-λg(X)
with respect to X and λ,
where λ is a new parameter
Jensen’s Inequality
• If f(x) is a concave function, the following equation
holds for arbitrary probability distribution of i
( ) ( )
( )
≤
∑
∑
ii i
i
x
i
p
f
x
f
i
p
Example:
( )1 0.6 ( ) (2 0.4 1 0.6 2)
4 .
0 f x + f x ≤ f x + x
Weighted average of function value
f(x)