Introduction to Automatic Speech Recognition

(1)

Speech and Language Processing

Lecture 4

Variational inference and sampling

Information and Communications Engineering Course

Takahiro Shinozaki

(2)

Lecture Plan (Shinozaki’s part)

1. 10/20 (remote)

Speech recognition based on GMM, HMM, and N-gram

2. 10/27 (remote)

Maximum likelihood estimation and EM algorithm

3. 11/3 (remote)

Bayesian network and Bayesian inference

4. 11/6 (@TAIST)

Variational inference and sampling

5. 11/7 (@TAIST)

Neural network based acoustic and language models

6. 11/7 (@TAIST)

(3)

Today’s Topic

• Answers for the previous exercises

• Variational inference

(4)

(5)

Exercise 3.1

• Is the directed graph a DAG?

Graph A Graph B

(6)

Exercise 3.2

• Represent the following joint probability by a BN

( ) ( ) (

| |

) (

|

)

( | ) )

, , , ,

(A B C D E P A P B A P C B P D B P E D

P =

A

B

C

(7)

Exercise 3.3

• Fill the blanks so that the following HMM and the

BN become equivalent

b

a

S₁

X₁ X₂ X₃ X_T S₂ S₃ S_T

   

 

b b

a a

σ µ σ µ

, , ,

(X b b)

N |µ ,σ

(X a a)

N | µ ,σ

P(a) P(b) 0.2 0.8 0.8

0.2 0.3

0.7

0.4

0.6

Initial state probability

P(S_t|S_t-1) S_t=a S_t=b S_t-1=a ( 0.3 ) ( 0.7 ) S_t-1=b ( 0.4 ) 0.6

Initial state

(8)

Exercise 3.4

• Assumes a probabilistic model P(x|μ), a training sample x₁, and a prior distribution of a parameter P(μ) are given as follows.

x x

1

μ

( )

= 2 exp− 2 

1 µ2

π µ

P

( )

= _−

(

−

)

_

2 exp

2 1 |

2

µ π

µ x

x P

(

) (

_{( )}

) ( )

1 1 1

| |

x P

P x

P µ = µ µ

( )

=

∫

∞

( ) (

µ µ

)

µ

1) Estimate posterior distribution

2) Estimate predictive distribution

( )

−cx dx = π_c

∫

−∞∞

2

exp

Note:

Gaussian distribution with mean μ and variance 1

Gaussian distribution with mean 0 and

(9)

(10)

(11)

Variational Bayes

Evaluating posterior distribution is often not feasible

(

)

₍

(

_{) ( )}

) ( )

q

(

D

)

d p

D p

p D

p |

| |

| ≈ Λ

Λ Λ Λ

Λ Λ

= Λ

∫

Too complex

Simpler distribution

( )

KL

[

q

( ) (

p D

)

]

q

| ,

min arg

* Λ = Λ Λ q

( ) (

Λ ≡ q Λ | D

)

(For simplicity of notation)

Let’s approximate it with a simpler distribution

The minimum of the KL divergence is found using

variational method

(12)

KL

and Lower bound

• Minimizing

KL

is equal to maximizing lower bound

( ) (

)

[

]

( )

₍

( )

₎

( )

( ) ( )

_{( )}

( )

(

)

( )

_

_{( )}

( )

_ Λ ΛΛ Λ + = Λ     Λ Λ Λ = Λ     ΛΛ Λ = Λ Λ

∫

d D p q q D p d D p D p q q d D p q q D p q KL , log log , log | log | ,

( )

[

( ) (

)

]

_∫

( )

_

( )

_{( )}

_ Λ

Λ Λ Λ + Λ Λ = d q D p q D p q KL D

p , | log ,

log

Model evidence

(Constant for q) Lower bound

[ ]

( )

_

( )

_{( )}

_ Λ Λ

Λ Λ

≡

∫

d

q

D p

q

q log ,

(13)

Approximation by Factorization

• Assumes (groups of) hidden variables are conditionally

independent

• It is called a mean filed approximation by analogy to physics

• No restriction on the functional forms

(

)

( )

{

M

}

M

i

i i

M

D

q

p

Λ

≈

∏

Λ

=

Λ

=

,

|

,

₁ ₂

1 2

(14)

Posterior Inference with Mean Field Approximation

• Suppose a probability model consist of two (groups of) hidden variables H, Z and a (group of) observation variable D

• We approximate p(H,Z|D) as q(H,Z|D)=q(H)q(Z)

( ) ( )

max

[ ]

arg

( ) ( )

max

[

( ) ( )

]

,

arg

q

Z

q

H

H q Z q H

q Z q

L

=

∫

q

( )

Z dZ =1,

∫

q

( )

H dH =1

( ) ( )

] =

[

( ) ( )

]

−

(

_∫

( )

−1

)

−

(

_∫

( )

−1

)

[q Z q H q Z q H q Z dZ q H dH

F L λ_Z λ_H

( ) ( )Z q H

F

[ ]

q

max

arg

(15)

Maximization of

_F

[q]

15

( ) ( )

, ]=

_∫∫

( ) ( )

log_

(

_{( ) ( )}

, ,

)

_ −

(

_∫

( )

−1

)

−

(

_∫

( )

−1

)

[ dZdH q Z dZ q H dH

H q Z q D H Z p H q Z q H q Z q

F λ_Z λ_H

( )

( , , ) ₀

log 0 , , log = −     ∂∂ = −     ∂∂

∫

H H H Z H Z H Z Z H Z H Z Z q dZ q q D H Z p q q q q dH q q D H Z p q q q λ

λ

_{( )}

H q q

Z q

q_z = , _H =

( )

(

)

( )

H C q

( )

Z

(

p

(

Z H D

)

dZ q dH D H Z p H q C Z q H Z , , log exp , , log exp

∫

= =

C_Z, C_H: normalization constant Variational method

(See the appendix for the derivation)

The maximum is obtained by alternatively updating q(Z) and

(16)

Variational GMM

(

)

( ) ( ) ( ) (

)

( ) ( )

_∏

(

) (

)

=

t t t

t

P

X

M

P

Z

D

P

Z

P

D

Z

P

p

,

|

,

|

,

θ

π

θ

π

θ

π

θ

π

θ

X₁ X₂ X_T

Θ M1 M2 MT

D Z

π

( )

(

)

( )

π

θ

(

π

θ

)

π

θ

π

θ

π

d

D

Z

p

q

Z

q

D

Z

p

Z

q

Z

,

log

,

exp

,

log

exp

,

∫

∑

∝

Mean field approximation:

(

Z

D

) (

q

Z

D

) (

q

D

) (

q

Z

D

)

(17)

Cont.

( ) ( )

( )

(

)

( ) ( )

( )

(

)

( )

( ) ( )

(

) (

)

( )

π

(

π

)

π

( )

θ

(

θ

)

∫

∫∫

∏

∑∑

⋅ ∝ ∝ = ∝ ∝

Cf. (compare with the above results)

X₁ X₂ X_T

Θ M1 M2 MT

D Z π ( ) ( ) ( )= _{( )} ∝ ( ) ( ) ( )=

_∏

( ) ( )=

_∑

( ) t t t

t P P M

M P P Z P P Z P Z P P Z

P π | π |π π |π π |π π exp log |π

( ) ( ) (∝ ) ( )=

_∑

( ) t t t M X P P Z D P P D Z

P θ | , θ |θ, θ exp log |θ,

(Mt Xt) (P Mt ) (P Xt Mt) ( (P(Mt ))) ( (P(Xt Mt)))

(18)

(19)

Pseudo Random Generator

• On digital computer, everything is deterministically

calculated and there is no randomness

• However, sometimes we want random numbers

• Most programing languages have a pseudo random

generator function

Python 2.6

> import random > random.random()

0.89388901900395423

> random.random()

0.98563591571989639

> random.random()

(20)

Sampling From a Uniform Distribution

• Random numbers distributed uniformly over some region Example:

Histogram of samples obtained from a uniform distribution over (0, 1) (To make the graph, scilab was used)

histplot(10,rand(1:1000),

10 samples 1000 samples

N

u

m

b

e

r

o

f

o

cc

u

rr

e

n

ce

N

u

m

b

e

r

o

f

o

cc

u

rr

e

n

ce

(21)

Sampling From a Gaussian Distribution

• Standard normal (Gaussian) distribution has a mean 0.0 and a variance 1.0

Example:

100 samples 10000 samples

N

u

m

b

e

r

o

f

o

cc

u

rr

e

n

ce

N

u

m

b

e

r

o

f

o

cc

u

rr

e

n

ce

(22)

Transform of Random Variable

• Let

x

be a random variable and

f

be a function

y

=

f

(

x

).

When

x

follows

p

(

x

),

y

follows the following

distribution

q

(

y

)

( ) ( )

dy

dx

x

p

y

(23)

Example

• When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(

y

)

( )

2

3

1 ,

0

1 )

(

+

=

∈

=

x

y

x

p

3

1 ,

3

2 =

−

=

dy

dx

y

x

( )

2 ,

5

3

1 )

(

y

=

x

∈

q

0 1 _x

p(x)

1

2 5 y

q(x)

(24)

Exercise 4.1

• When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(

y

)

( )

y

( )

x

p

(

)

=

1 ∈

0 ,

1 ,

=

−

log

1 −

( )

y

dy dx y x y x y x − = − − = ∞ = → = = → = exp , exp 1 1 , 0 0

)

exp(

)

(

)

(

y

dy

dx

x

p

y

q

=

−

# o f o cc u rr e n ce

Histogram of x Histogram of y

(25)

Exercise 4.2

• When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(y)

(

| 0,1

)

(

,

)

, 3 4 2 exp 2 1 ) ( 2 + = ∞ ∞ − ∈ =    −

= x N x x y x

x p π 3 1 , 3 4 = − = dy dx y

x

(

)

(

2

)

2 2

2 ₃ | 4,3

4 2 1 exp 3 2 1 ) ( )

( y N y

dy dx x p y q =     ₋ − = = π x # o f o cc u rr e n ce

Histogram of x

y

Histogram of y

(26)

Sampling form Complex Distributions

• Distributions that can be obtained by a

transformation from a simple distribution (such as

uniform distribution) is limited

• We need sampling methods that do not require

integral and inverse of a function, and can be

applied to more complex distributions

• Ancestral sampling

• Rejection sampling

(27)

Ancestral Sampling

• Assumptions:

• We want samples from a joint distribution p(x₁, x₂,…, x_M)

• The joint distribution is given as a Bayesian network

• Sampling from the conditional distributions are easy

• Algorithm:

• Sample from the parent nodes to the child nodes in order

• For the child nodes, use the already sampled parent value in the conditional part

D A

E

C

B P(A)  P(B|A)  P(C|B)

(28)

Rejection Sampling

• Assumptions:

• We want samples from a distribution p(x). The normalization constant Z may be unknown.

• We have a distribution q(x) from which we can easily derive samples. We refer to q as a proposal distraction

( )

p

( )

x

Z

x

p

=

1 ~

( )

x p

(29)

Procedure of Rejection Sampling



Algorithm:

1. Choose a constant k so that the following holds

2. Derive a sample s from q(x)

3. Derive a sample u from a uniform distribution ranging [0, kq(s)]

4. If u > then reject the sample. Otherwise, adopt it

kq(x)

u

( )

x p

~

( ) ( )

x

p

x

kq

≥

~

s x

kq(s)

( )

x p

(30)

Markov Chain Monte Carlo

• General and powerful framework for sampling

• Scales well with the dimensionality of the sample space

• Maintains a state that forms a Markov chain. The

set of the states follows the desired distribution

• Metropolis algorithm

• Metropolis-Hastings algorithm

(31)

Metropolis Algorithm

• Assumptions:

• We want samples from a distribution P(X)

• The normalization constant Z may be unknown

• Initialization:

1. Prepare a symmetric proposal distribution q(x_A|x_B) that satisfy q(x_A|x_B)=q(x_B|x_A)

2. Prepare an initial state x₀

( )

p

( )

X

Z X

(32)

Exercise 4.3

• Show that

N

(

x

_A

|

x

_B

, 1) =

N

(

x

_B

|

x

_A

,1), where

N

(

x

|

m

,

v

)

is the Gaussian distribution with mean

m

and

variance

v

(

)

(

)

(

)

(

| ,1

)

2 1 exp

2 1

2 1 exp

2 1 1

, |

2 2

A B

B A

x x N x

x

x x

x x N

=    



₋ ₋

=

   



₋ ₋

=

(33)

Procedure of Metropolis Algorithm

• Algorithm:

1. Get a candidate sample x* from the proposal distribution q(x|x_t) based on the current state x_t

2. Accept the candidate with probability or discard it

3. If the candidate is accepted, save it as the next state x_t₊₁. If it is discarded, then set x_t₊₁ equals to x_t

4. Goto step 3

33

( )= _ ( )_{( )} _

t t

x p

x p x

x

A _~ *

~ , 1 min *,

x₀

x₁

x_t x₂ x₃

x₄

(34)

Metropolis-Hastings Algorithm

• An extension of the Metropolis algorithm

• No symmetric requirement for the proposal

distribution

• The acceptance probability of the candidate

state is defined as follows. Other part is the same

as the Metropolis algorithm

(

)

=

_



( ) (

_{( ) (}

₎

)

_



t k

t

t k t

k

x

q

x

p

x

q

x

p

x

A

|

*

~

*

|

*

~

(35)

Gibbs Sampling

• Problem:

• We want samples from a joint distribution p(x₁, x₂,…, x_M)

• Algorithm:

1. Prepare an initial state X₀ = <x₁,x₂,…,x_M>₀

2. Select one of the variables x_iin order or at random 3. Get a sample from and update x_iwith that

value

4. Goto step 2. After enough iterations, the distribution of

x_tfollows p(X)

(

x X i

)

p _i | \

Compared to the Metropolis algorithm:

• Sampling from conditional distribution of x_igiven all other variables need to be feasible