• 検索結果がありません。

Introduction to Automatic Speech Recognition

N/A
N/A
Protected

Academic year: 2018

シェア "Introduction to Automatic Speech Recognition"

Copied!
53
0
0

読み込み中.... (全文を見る)

全文

(1)

Speech and Language Processing

Lecture 2

Maximum likelihood estimation and EM algorithm

Information and Communications Engineering Course

(2)

Lecture Plan (Shinozaki’s part)

1. 10/20 (remote)

Speech recognition based on GMM, HMM, and N-gram

2. 10/27 (remote)

Maximum likelihood estimation and EM algorithm

3. 11/3 (remote)

Bayesian network and Bayesian inference 4. 11/6 (@TAIST)

Variational inference and sampling 5. 11/7 (@TAIST)

Neural network based acoustic and language models 6. 11/7 (@TAIST)

Weighted finite state transducer (WFST) and speech decoding

(3)

Today’s Topic

• Answers for the previous exercises

• Brief review of probability theory

• Maximum likelihood estimation

(4)
(5)

Exercise 1.1 (Answer)

Suppose W is a vowel and O is a MFCC feature vector.

Suppose that PAM(O|W) is an acoustic model and P

LM(W) is a

language model. Obtain a vowel �� that maximizes P(W|O)

when the acoustic and language model log likelihoods are given as in the following table.

{ }

{

P

(

W

O

)

}

W

o e u i a W

|

max

arg

ˆ

, , , ,

=

Vowel W a i u e o

log(P(O|W)) -13.4 -10.5 -30.1 -15.2 -17.0

log(P(W)) -1.61 -2.30 -1.61 -1.39 -1.39

Log(P(O|V))

(6)

Exercise 1.2(Answer)

• The following table defines a Bi-gram P(Word|Context)

CW today is sunny End

Start 0.6 0.1 0.2 0.1 today 0.1 0.5 0.3 0.1 is 0.1 0.1 0.7 0.1 sunny 0.1 0.1 0.2 0.6

Start today is sunny End

1.0x0.6x0.5x0.7x0.6=0.126

0.6 0.5 0.7 0.6

*P(Start)=1.0 Example :

(7)

Exercise 1.2 (Cont.) (Answer)

• Based on the bigram definition of the previous slide,

compute the probability of the following sentences

1) P(“Start today sunny today sunny End”)

= 0.6*0.3*0.1*0.3*0.6 = 0.00324

2) P(“Start today today sunny sunny End”)

(8)
(9)

Probability Space

• Sample space (Ω)

• Set of all possible outcomes of an experiment

• Probability function (f(x))

• A function that maps each outcome to a probability

• Event (E)

• Subset of the sample space

( )

x

[ ]

x∈Ω

f 0,1 for all

( )

=1

x f x

( )

= f x

E P( )

(10)

Random Variable

• A function that maps an outcome of an experiment

to a value

• Notation:

P[X=x] = p” means “the probability of a random variable X takes a value x is p

Example

Random Variable

1 2 3 4 5 6

X 1 2 3 4 5 6

Y 1 0 1 0 1 0

Z 0 0 1 1 1 1

X: The value of a die

(11)

Joint Probability

• Probability that more than one events jointly occur

Example

:

X

Dice 1

:

Y

Dice 2

:

)

,

(

X

i

Y

j

P

=

=

Probability that the value of X is i and

(12)

Conditional Probability

• Probability of an event given that another event

has occurred

Example

:

X

Color of the first ball

Randomly picks up two balls sequentially from a box containing 4 blue and 6 green balls

:

Y

Color of the second ball

9

3

)

|

(

9

4

)

|

(

=

=

=

=

=

=

blue

X

blue

Y

P

geen

X

blue

Y

(13)

Two Principal Rules

• Sum rule

• Summing joint probability P(X,Y) for all possible values of Y gives probability of P(X)

P(X) is called the marginal probability

• Product rule

• Multiplying probability P(X) and joint probability P(Y|X)

is equal to joint probability P(X,Y)

=

=

=

=

j

j i

i

P

X

x

Y

y

x

X

P

(

)

(

,

)

)

(

)

|

(

)

,

(

X

x

i

Y

y

j

P

Y

y

j

X

x

i

P

X

x

i

(14)

Bayes’ Theorem

• From the product rule, we obtain:

i i i j j i i

j for x y

x X P y Y P y Y x X P x X y Y P , ) ( ) ( ) | ( ) | ( ∀ = = = = = = =

If we denote the distributions as P(X) etc., we have:

) ( ) ( ) | ( ) | ( X P Y P Y X P X Y P =

Using the sum rule, it can be expressed as:

(15)

Independence

• If the joint distribution of two variables X and Y

factorizes into the product of the marginals, then X

and Y are said to be “independent”

(

X Y

) ( ) ( ) ( )

P Y P X P Y P

Y X

P( , ) = | =

(

X Y

) ( ) ( ) ( )

P Y P X P Y P

Y X

P( , ) = | ≠

X and Y are independent

(16)

Probability Densities

• If the probability of a real-valued variable x falling

in the interval (x, x+δx) is given by p(x)δx when

δx→0, p(x) is called the probability density of x

x

δx

p(x)δx p(x)

p(x)

p(x)δx is probability

( )

≥ 0 and

( )

=1

p x dx

(17)

The Sum and The Product Rules For Continuous Variable

=

=

=

=

j

j i

i

P

X

x

Y

y

x

X

P

(

)

(

,

)

)

(

)

|

(

)

,

(

X

x

i

Y

y

j

P

Y

y

j

X

x

i

P

X

x

i

P

=

=

=

=

=

=

dy

y

x

p

x

p

(

)

=

(

,

)

( ) ( )

y

x

p

x

p

y

x

(18)

Expectation

• Expectation of a function f(x) under a probability

distribution p(x) is denoted by E[f]

[ ]

( )

[ ]

f

p

x

f

( )

x

dx

E

x

f

x

p

f

E

x

=

=

)

(

)

(

(x is discrete)

(19)

Mean and Variance

• Mean

• Synonym of the expectation E[f(x)]

• Variance

• A measure of how much variability there is in f(x)

around its mean value E[f(x)]

• In particular, the variance of the variable x itself is:

[ ]

[

(

( )

[ ]

( )

)

2

]

[ ]

( )

2

[ ]

( )

2

var fE f xE f x = E f xE f x

[ ]

[

(

[ ]

)

2

]

(20)

Covariance

• Covariance

• The extent to which x and y vary together

[ ]

[

(

[ ]

)

(

[ ]

)

]

[ ] [ ] [ ]

xy

E

x

E

y

E

y

E

y

x

E

x

E

y

x

y x

y x

=

, ,

,

cov

(21)

Entropy

• Amount of randomness in the random variable

[ ]

x

E

[

( )

p

( )

x

]

p

( )

x

p

( )

x

H

x

log

log

=

=

x 0 1

p(x) 0.5 0.5 Example

[ ]

693 . 0 ) 5 . 0 log( 5 . 0 ) 5 . 0 log( 5 . 0 = − − = x H

x 0 1

p(x) 0.1 0.9

(22)

Relative Entropy

• A measure of dissimilarity of two distributions p and q

• Also called as kullback-Leibler (KL) divergence

• KL(p||q) is nonnegative.

KL(p||q)=0 if and only if p(x)=q(x)

( )

( )

( )

( )

( )

( )

dx

x p

x q x

p

x q

x p E

q p

KL p

  

 −

=

   

 

  

 =

log log ||

(23)
(24)

Maximum Likelihood (ML) Method

• Assume that we have a set of samples D={x1, x2, … xn}

drawn from a distribution p(x|θ) with unknown

parameters θ, and we want to estimate θ

• Maximum likelihood method estimates the parameters

by maximizing likelihood p(D|θ)

( )

( )

=

=

=

n

i

i

x

p

D

p

1

|

max

arg

|

max

arg

ˆ

θ

θ

θ

θ θ

(25)

Bernoulli Distribution

• Probability distribution of a binary random variable

which takes value 1 with probability μ and value 0

with probability 1-μ

x 0 1

Bern(x) 1-μ μ

( )

x

( )

x

x

(26)

ML Estimation for Bernoulli Distribution

• Parameter θ in this case is : μ

• Training sample � = 0 �� 1

( ) ( ) ( ) ( ) ( ) ( )       + =     = − = =

− = − i i i i i x x n i x x x x D p i i i i µ µ µ µ µ µ µ µ µ µ µ µ 1 log 1 log max arg 1 log max arg 1 max arg | max arg ˆ 1 1 1

( ) ( ) ( )

= =       + ∂∂ i i i i i i x n x x 1 0 1 log 1 log µ µ µ µ

Taking log does not

change the problem and makes the equation a bit easier

(27)

Example

• You tossed a winded coin 100 times, and got 62

heads and 38 tails. Estimate the probability μ of

getting head with the coin by the ML method

(

1

62

0

38

)

0

.

62

100

1

100

1

=

×

+

×

=

=

i

i

x

(28)

Categorical Distribution

• As a generalization of the Bernoulli Distribution,

lets consider a discrete random variable X that

takes K values

28

X 1 2 … K

1-of-K

<x1,x2,…,xK> <1,0,..,0> <0,1,..,0> <0,0,..,1>

p(X) μ1 μ2 μK

( )

=

=

K

k

x k

k

p

1

|

μ

µ

(29)

ML for Categorical Distribution

• Parameter θ in this case is : μ ={μ1, μ2, , μK}

• Training sample � is a vector of 1-of-K representation.

When � represents �-th value, �,=1, and �,=0 for

� ≠ �

( )

1 max arg max arg | max arg ˆ 1 1 1 1 , = = = =

∏∏

= = = = K k k K k m k n i K k x k k k i D p

µ

µ

µ

μ μ μ μ μ

mkis the number of the occurrence of k-th value, where n is the number of samples

= = n i k i k x m 1 ,

(30)

Solution

           + =

= = K k k k K k k m 1 1 1 log max arg

ˆ

µ

λ

µ

(31)

Exercise 2.1

• Show the derivation process of obtaining

( )

  

+

=

= =

K

k

k k

K

k

k

m L

1 1

1 log

,λ µ λ µ

μ

n mk

k =

µ

for the categorical distribution by maximizing

(32)

ML Estimation for Gaussian Distribution

(

)

( )       = 2 2 2 2 2 1 exp 2 1 , | µ σ πσ σ µ x x N

(

)

−∞∞ | , =1

2

dx x

N µ σ

Parameter θ in this case is : {μ, σ} Training sample � is a real value

( )

(

( )

)

= = = = n i i n i

i N x

x N 1 1 | log max arg | max arg

ˆ θ θ

θ

θ θ

Gaussian distribution:

(33)

Exercise 2.2

(34)

ML Estimation for GMM with Known Index

• Let’s consider 2-mix GMM (component index m is 1 or 2)

• A training sample � =<�, > is a pair of an observation � and an index of Gaussian component �, where i is a sample index

( )

(

)

(

(

)

)

( )

log( ( | , )) log( ( | , )), 1.0 log , | log , | log , , , , , 2 1 2 | 2 2 1 | 1 1 1 1 1 2 2 2 1 1 1 = + + + = = =

= = = = = w w o N o N w o N w o N w w w L i i i i i i i i i m i i m i i n i m n i m m i m n i m m i m σ µ σ µ σ µ σ µ σ µ σ µ

(

1 1 1 2 2 2

)

, , , , , , , , , , max arg 2 2 2 1 1 1 w w L w

w µ σ µ σ µ σ

σ µ

( )

( ) ( ) ( ) ( )

= = = 2 | 2 2 , 1 | 1 1 , 1 , , | log max arg , | log max arg log max arg 2 2 1 1 2 1 i i i m i i m i i n i m w w o N o N w σ µ σ µ σ µ σ µ

(35)

ML Estimation for HMM with Known Path

• Both observation and state sequences are given

• Transition probability:Transition probability from state i to j is obtained

by dividing the number of transitions from state i to j by the number of transition from state i

• Emission probability:ML estimate of the emission distribution based on

the observations assigned to the state

s

1

s

2 s3

s0

HMM: Λ

Example: (Observation is a binary value taking ‘a’ or ‘b’)

When O=(a,b,a,b,b) and K=(s0,s1,s1,s2,s2,s2,s3) Transition probability

p(s1→s1) =1/2, p(s1→s2)=1/2 p(s2→s2) =2/3, p(s2→s3)=1/3 Emission probability

(36)

Exercise 2.3

• Given a training data D with n training samples

D={x1, x2,…,xn}, obtain ML estimation for GMM

with M mixtures.

You can assume the variance is 1 for simplicity.

(37)

???

Exercise 2.3 (Cont.)

=

m

L

µ

(

)

    

  

    

  

  

=

= =

M m

m i

m N

i

X w

L

1

2

1 2

exp 2

1

log µ

(38)
(39)

Hidden Variable

• The mixture weight �of GMM can be regarded as a

probability �(�) since it is non-negative and sum to one

• Then, the GMM can be seen as a marginal probability of

� �, � = � � �(�|�, σ)

• In general, when a probability model is defined as a

marginal probability, the summed-out variable is not

seen from the outside, and it is called a hidden variable

• The mixture weight of GMM is a hidden variable

( )

(

)

( ) (

)

=

= =

= M

m

m m

M m

m m

mN x P m N x

w x

GMM

1 1

, |

,

(40)

ML Estimation for Models with Hidden Variables

• The summation Σ for the marginalization is often

problematic for optimization

(

)

[

]

(

)

(

)





Θ

=

  

Θ

=

Θ =

Θ

∑ ∑

Θ Θ

Θ

i h

i i

i

h x P x P D P

| , log

max arg

| log

max arg

| max

arg ˆ

(41)

Jensen Lower Bound of Likelihood

(

)

(

)

( ) (

( )

)

( )

(

( )

)

H q

H X P H

q

H q

H X P H

q

H X P X

P

H

H H

Θ ≥

Θ =

Θ =

Θ

| ,

log

| ,

log

| ,

log |

log

For arbitrary �(�)

By Jensen’s inequality

( )

( )

P

(

X H Θ

)

Θ

, |

This inequality holds for arbitrary � and arbitrary Θ Let � be an observed variable, H be a hidden variable,

(42)

Exercise 2.4

• Assume you have an initial model parameter Θ0 .

Prove that if you take � � = �0 � = �(�|�, Θ0),

then the lower bound �(�0, Θ0) is equal to the log likelihood ����(�|Θ0)

(

| 0

)

(

(

| , 0

)

, 0

)

(43)

Maximization of the Lower Bound

• Assume we have an initial model parameter Θ0 .

(

)

(

Θ Θ

)

= Θ

Θmax | , ,

arg 0

1 J P H X

(

| 0

)

log

(

| 1

)

log P X Θ ≤ P X Θ

(

| 0

)

(

(

| , 0

)

, 0

)

log P X Θ = J P H X Θ Θ

Because: J

(

P

(

H | X ,Θ

)

,Θ

)

log P

(

X | Θ

)

for ∀Θ

0

(

)

(

P H | X,Θ0 ,Θ0

)

J

(

P

(

H | X,Θ0

)

,Θ1

)

J

By maximizing the lower bound J with respective to Θ, we can find Θ1

(44)

Relation Between the Likelihood and the Lower Bound

( )

(

)

( )

Θ ≡

(

(

Θ

)

Θ

)

Θ ≡

Θ

, ,

| | log

0 X

H P J g

X P f

1

Θ 0

Θ

g

0 1 = Θ

Θ

(45)

Q-function

(

Θ,Θ0

)

P

(

H | X0

)

log P

(

X , H | Θ

)

Q H Let ( ) ( ) ( ) (( )) ( ) ( ) ( ) ( )

( 0)

0 0 0 0 0 0 , max arg | , log , | | , log , | max arg , | | , log , | max arg , , | max arg Θ Θ =     

Θ Θ Θ Θ

= ΘΘ Θ = Θ Θ Θ Θ Θ Θ

Q H X P X H P H X P X H P X H P H X P X H P X H P J H H H

Finding the argmax of J is equal to finding the argmax of

(46)

Expectation Maximization (EM) Algorithm

1. Prepare an initial parameter (or parameter set) Θ0

2. Given a parameter Θt , obtain a Q-function

Q(Θ, Θt), which is an expectation of the log joint

probability log�(�, �|Θ) with � � �, Θt

[E-step]

3. Maximizing the Q-function Q(Θ, Θt) and obtain an

updated parameter Θt+1

[M-step]

(47)

The Process of the EM Algorithm

(

)

Θ

=

Θ

0

Θ Initial model parameters.

May be just a random number.

( )

[

0

]

1 = arg max Θ,Θ

Θ

Θ Q

Update the parameters

( )

[

1

]

2 = arg max Θ,Θ

Θ

Θ Q

Update the parameters

(

)

[

2

]

3 = arg max Θ,Θ Θ

Θ Q

Update the parameters

(48)

EM for GMM

• Let’s consider 2-mix GMM

• Assume a training data D with n training samples

D={o1, o2,…,on} , and an initial parameter set

Θ0={μ1, σ1, w1, μ2, σ2, w2} are given

• The posterior probability P(mi|D,Θ0) of the component index m for the i-th training sample is :

( ) ( ) ( ) ( )

(

)

( )

= = = = = 2 1 2 1 0 0 0 0 , | , | | , | , , | , | m m m i m m m i m m i i i i i i o N w o N w o m P o m P o m P D m

P i i i

σ µ σ µ Θ Θ Θ Θ

(Θ Θ )=

( Θ ) ( Θ)=

∑∑

( Θ ) ( Θ)

= = > =< | , log , | | , log , | , 1 2 1 0 , , 0 0 2 1 m o P o m P M D P D M P Q i n i m i m m m

(49)

Exercise 2.5

• Consider the 2-mix GMM of the previous page.

Let γ � = �(�|��, Θ0). Obtains the followings.

(

)

(

)

(

0

)

1

0 1

0 1

, max

arg ˆ

, max

arg ˆ

, max

arg ˆ

1 1 1

Θ Θ =

Θ Θ =

Θ Θ =

Q w

Q Q

w

σ µ

(50)

EM Estimation for HMM with Unknown Path

50

s1 s2 s3

[p(a),p(b)] = S1: [0.7, 0.3] S2: [0.6, 0.4]

s0

1.0

0.6 0.2

0.4 0.8

Initial HMM: Λ When O=(a,b,b), possible paths are:

K1(s0,s1,s1,s2,s3) and K2(s0,s1,s2,s2,s3)

a b b a b b

007168 . 0 ) | , ( 016128 . 0 ) | , ( 2 1 = Λ = Λ K P K P O O ( )

( ) ( ( )) 0.7, ( | , ) 0.3 007168 . 0 016128 . 0 016128 . 0 | , | , | | , ) , |

( 1 1 2

1 ≈ Λ ≈

+ = Λ Λ = ΛΛ = Λ

P K O P K O O K P O P O K P O K P K Posterior probability

Expectations of transitions

n(s1→s1)=1*0.7+0*0.3=0.7 n(s1→s2)=1*0.7+1*0.3=1 n(s2→s2)=0*0.7+1*0.3=0.3 n(s2→s3)=1*0.7+1*0.3=1

E-step (expectation step)

M-step (maximization step)

Expectations of emissions

n(a|s1) =1*0.7+1*0.3=1 n(b|s1) =1*0.7+0*0.3=0.7 n(a|s2) =0*0.7+0*0.3=0 n(b|s2) =1*0.7+2*0.3=1.3

New transition probabilities

p(s1→s1)=0.7/(0.7+1)=0.41 p(s1→s2)=1/(0.7+1)=0.59

p(s2→s2)=0.3/(0.3+1)=0.23 p(s2→s3)=1/(0.3+1)=0.77

New emission probabilites

p(a|s1)=1/(1+0.7)=0.59 p(b|s1)=0.7/(1+0.7)=0.41

p(a|s2)=0/(0+1.3)=0.00 p(b|s2)=1.3/(0+1.3)=1.00

(51)
(52)

Method of Lagrange Multiplier

Maximize f(X) subject to g(X)=0

Maximize f(X)-λg(X)

with respect to X and λ,

where λ is a new parameter

(53)

Jensen’s Inequality

• If f(x) is a concave function, the following equation

holds for arbitrary probability distribution of i

( ) ( )

( )

i

i i

i

x

i

p

f

x

f

i

p

Example:

( )1 0.6 ( ) (2 0.4 1 0.6 2)

4 .

0 f x + f xf x + x

Weighted average of function value

f(x)

参照

関連したドキュメント

Recognition process with a laser-assisted range sensor(B) 3.1 Principle of coil profile measurement This system is only appii~ble fm the case where the coils are all

In order to estimate the noise spectrum quickly and accurately, a detection method for a speech-absent frame and a speech-present frame by using a voice activity detector (VAD)

patient with apraxia of speech -A preliminary case report-, Annual Bulletin, RILP, Univ.. J.: Apraxia of speech in patients with Broca's aphasia ; A

In this work we apply the theory of disconjugate or non-oscillatory three- , four-, and n-term linear recurrence relations on the real line to equivalent problems in number

The Mathematical Society of Japan (MSJ) inaugurated the Takagi Lectures as prestigious research survey lectures.. The Takagi Lectures are the first se- ries of the MSJ official

The Mathematical Society of Japan (MSJ) inaugurated the Takagi Lectures as prestigious research survey lectures.. The Takagi Lectures are the first series of the MSJ official

Keywords: continuous time random walk, Brownian motion, collision time, skew Young tableaux, tandem queue.. AMS 2000 Subject Classification: Primary:

6 Scene segmentation results by automatic speech recognition (Comparison of ICA and TF-IDF). 認できた. TF-IDF を用いて DP