• 検索結果がありません。

Introduction to Automatic Speech Recognition

N/A
N/A
Protected

Academic year: 2018

シェア "Introduction to Automatic Speech Recognition"

Copied!
40
0
0

読み込み中.... (全文を見る)

全文

(1)

Speech and Language Processing

Lecture 4

Variational inference and sampling

Information and Communications Engineering Course

Takahiro Shinozaki

(2)

Lecture Plan (Shinozaki’s part)

1. 10/20 (remote)

Speech recognition based on GMM, HMM, and N-gram

2. 10/27 (remote)

Maximum likelihood estimation and EM algorithm

3. 11/3 (remote)

Bayesian network and Bayesian inference

4. 11/6 (@TAIST)

Variational inference and sampling

5. 11/7 (@TAIST)

Neural network based acoustic and language models

6. 11/7 (@TAIST)

(3)

Today’s Topic

Answers for the previous exercises

Variational inference

(4)
(5)

Exercise 3.1

Is the directed graph a DAG?

Graph A Graph B

(6)

Exercise 3.2

Represent the following joint probability by a BN

( ) ( ) (

| |

) (

|

)

( | ) )

, , , ,

(A B C D E P A P B A P C B P D B P E D

P =

A

B

C

(7)

Exercise 3.3

Fill the blanks so that the following HMM and the

BN become equivalent

b

a

S1

X1 X2 X3 XT S2 S3 ST

   

 

b b

a a

σ µ σ µ

, , ,

(X b b)

N |µ ,σ

(X a a)

N | µ ,σ

P(a) P(b) 0.2 0.8 0.8

0.2 0.3

0.7

0.4

0.6

Initial state probability

P(St|St-1) St =a St =b St-1 =a ( 0.3 ) ( 0.7 ) St-1 =b ( 0.4 ) 0.6

Initial state

(8)

Exercise 3.4

• Assumes a probabilistic model P(x|μ), a training sample x1, and a prior distribution of a parameter P(μ) are given as follows.

x x

1

μ

( )

= 2 exp− 2 

1 µ2

π µ

P

( )

= −

(

)



2 exp

2 1 |

2

µ π

µ x

x P

(

) (

( )

) ( )

1 1 1

| |

x P

P x

P x

P µ = µ µ

( )

=

( ) (

µ µ

)

µ

1) Estimate posterior distribution

2) Estimate predictive distribution

( )

cx dx = πc

−∞∞

2

exp

Note:

Gaussian distribution with mean μ and variance 1

Gaussian distribution with mean 0 and

(9)
(10)
(11)

Variational Bayes

Evaluating posterior distribution is often not feasible

(

)

(

(

) ( )

) ( )

q

(

D

)

d p

D p

p D

p D

p |

| |

| ≈ Λ

Λ Λ Λ

Λ Λ

= Λ

Too complex

Simpler distribution

( )

KL

[

q

( ) (

p D

)

]

q

q

| ,

min arg

* Λ = Λ Λ q

( ) (

Λ ≡ q Λ | D

)

(For simplicity of notation)

Let’s approximate it with a simpler distribution

The minimum of the KL divergence is found using

variational method

(12)

KL

and Lower bound

Minimizing

KL

is equal to maximizing lower bound

( ) (

)

[

]

( )

(

( )

)

( )

( ) ( )

( )

( )

(

)

( )



( )

( )

 Λ ΛΛ Λ + = Λ     Λ Λ Λ = Λ     ΛΛ Λ = Λ Λ

d D p q q D p d D p D p q q d D p q q D p q KL , log log , log | log | ,

( )

[

( ) (

)

]

( )



( )

( )

 Λ

Λ Λ Λ + Λ Λ = d q D p q D p q KL D

p , | log ,

log

Model evidence

(Constant for q) Lower bound

[ ]

( )



( )

( )

 Λ Λ

Λ Λ

d

q

D p

q

q log ,

(13)

Approximation by Factorization

Assumes (groups of) hidden variables are conditionally

independent

• It is called a mean filed approximation by analogy to physics

• No restriction on the functional forms

(

)

( )

{

M

}

M

i

i i

M

D

q

p

Λ

Λ

Λ

Λ

Λ

=

Λ

Λ

Λ

=

,

,

,

|

,

,

,

1 2

1 2

(14)

Posterior Inference with Mean Field Approximation

• Suppose a probability model consist of two (groups of) hidden variables H, Z and a (group of) observation variable D

• We approximate p(H,Z|D) as q(H,Z|D)=q(H)q(Z)

( ) ( )

max

[ ]

arg

( ) ( )

max

[

( ) ( )

]

,

arg

q

q

Z

q

H

H q Z q H

q Z q

L

L

=

q

( )

Z dZ =1,

q

( )

H dH =1

( ) ( )

] =

[

( ) ( )

]

(

( )

−1

)

(

( )

−1

)

[q Z q H q Z q H q Z dZ q H dH

F L λZ λH

( ) ( )Z q H

F

[ ]

q

q

max

arg

(15)

Maximization of

F

[q]

15

( ) ( )

, ]=

∫∫

( ) ( )

log

(

( ) ( )

, ,

)

 −

(

( )

−1

)

(

( )

−1

)

[ dZdH q Z dZ q H dH

H q Z q D H Z p H q Z q H q Z q

F λZ λH

( )

( , , ) 0

log 0 , , log = −     ∂∂ = −     ∂∂

H H H Z H Z H Z Z H Z H Z Z q dZ q q D H Z p q q q q dH q q D H Z p q q q λ

λ

( )

( )

H q q

Z q

qz = , H =

( )

( )

(

(

)

)

( )

H C q

( )

Z

(

p

(

Z H D

)

)

dZ q dH D H Z p H q C Z q H Z , , log exp , , log exp

= =

CZ, CH: normalization constant Variational method

(See the appendix for the derivation)

The maximum is obtained by alternatively updating q(Z) and

(16)

Variational GMM

(

)

( ) ( ) ( ) (

)

( ) ( )

(

) (

)

=

=

=

t t t

t

P

X

M

M

P

P

P

Z

D

P

Z

P

P

P

D

Z

P

p

,

|

|

,

|

|

,

,

,

θ

π

θ

π

θ

π

θ

π

π

θ

X1 X2 XT

Θ M1 M2 MT

D Z

π

( )

( )

(

(

)

)

( )

( )

π

θ

(

(

π

θ

)

)

π

θ

θ

π

θ

π

d

d

D

Z

p

q

Z

q

D

Z

p

Z

q

q

Z

,

,

,

log

,

exp

,

,

,

log

exp

,

Mean field approximation:

(

Z

D

) (

q

Z

D

) (

q

D

) (

q

Z

D

)

(17)

Cont.

( ) ( )

( )

(

(

)

)

( ) ( )

( )

(

(

)

)

( )

( )

( )

( ) ( )

(

(

) (

)

)

( )

π

(

(

π

)

)

π

( )

θ

(

(

θ

)

)

θ θ π θ π θ π θ θ θ π π π d M X P q d M P q d d M X P M P q q M q M q Z q M X P M q P q M P M q P q t t t t t t t t t t t t M t t t M t t t , | log exp | log exp , | | log exp , | log exp | log exp

∫∫

∑∑

∑∑

⋅ ∝ ∝ = ∝ ∝

Cf. (compare with the above results)

X1 X2 XT

Θ M1 M2 MT

D Z π ( ) ( ) ( )= ( ) ∝ ( ) ( ) ( )=

( ) ( )=

( ) t t t

t P P M

M P P Z P P Z P Z P P Z

P π | π |π π |π π |π π exp log |π

( ) ( ) (∝ ) ( )=

( ) t t t M X P P Z D P P D Z

P θ | , θ |θ, θ exp log |θ,

(Mt Xt) (P Mt ) (P Xt Mt) ( (P(Mt ))) ( (P(Xt Mt)))

(18)
(19)

Pseudo Random Generator

On digital computer, everything is deterministically

calculated and there is no randomness

However, sometimes we want random numbers

Most programing languages have a pseudo random

generator function

Python 2.6

> import random > random.random()

0.89388901900395423

> random.random()

0.98563591571989639

> random.random()

(20)

Sampling From a Uniform Distribution

• Random numbers distributed uniformly over some region Example:

Histogram of samples obtained from a uniform distribution over (0, 1) (To make the graph, scilab was used)

histplot(10,rand(1:1000),

10 samples 1000 samples

N

u

m

b

e

r

o

f

o

cc

u

rr

e

n

ce

N

u

m

b

e

r

o

f

o

cc

u

rr

e

n

ce

(21)

Sampling From a Gaussian Distribution

• Standard normal (Gaussian) distribution has a mean 0.0 and a variance 1.0

Example:

100 samples 10000 samples

N

u

m

b

e

r

o

f

o

cc

u

rr

e

n

ce

N

u

m

b

e

r

o

f

o

cc

u

rr

e

n

ce

(22)

Transform of Random Variable

Let

x

be a random variable and

f

be a function

y

=

f

(

x

).

When

x

follows

p

(

x

),

y

follows the following

distribution

q

(

y

)

( ) ( )

dy

dx

x

p

y

(23)

Example

When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(

y

)

( )

2

3

1

,

0

1

)

(

+

=

=

x

y

x

x

p

3

1

,

3

2

=

=

dy

dx

y

x

( )

2

,

5

3

1

)

(

y

=

x

q

0 1 x

p(x)

1

2 5 y

q(x)

(24)

Exercise 4.1

When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(

y

)

( )

y

( )

x

x

x

p

(

)

=

1

0

,

1

,

=

log

1

( )

( )

y

dy dx y x y x y x − = − − = ∞ = → = = → = exp , exp 1 1 , 0 0

)

exp(

)

(

)

(

y

dy

dx

x

p

y

q

=

=

# o f o cc u rr e n ce

Histogram of x Histogram of y

(25)

Exercise 4.2

When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(y)

(

| 0,1

)

(

,

)

, 3 4 2 exp 2 1 ) ( 2 + = ∞ ∞ − ∈ =    −

= x N x x y x

x p π 3 1 , 3 4 = − = dy dx y

x

(

)

(

2

)

2 2

2 3 | 4,3

4 2 1 exp 3 2 1 ) ( )

( y N y

dy dx x p y q =     − = = π x # o f o cc u rr e n ce

Histogram of x

y

Histogram of y

(26)

Sampling form Complex Distributions

Distributions that can be obtained by a

transformation from a simple distribution (such as

uniform distribution) is limited

We need sampling methods that do not require

integral and inverse of a function, and can be

applied to more complex distributions

• Ancestral sampling

• Rejection sampling

(27)

Ancestral Sampling

Assumptions:

• We want samples from a joint distribution p(x1, x2,…, xM)

• The joint distribution is given as a Bayesian network

• Sampling from the conditional distributions are easy

Algorithm:

• Sample from the parent nodes to the child nodes in order

• For the child nodes, use the already sampled parent value in the conditional part

D A

E

C

B P(A)  P(B|A)  P(C|B)

(28)

Rejection Sampling

Assumptions:

• We want samples from a distribution p(x). The normalization constant Z may be unknown.

• We have a distribution q(x) from which we can easily derive samples. We refer to q as a proposal distraction

( )

p

( )

x

Z

x

p

=

1

~

( )

x p

(29)

Procedure of Rejection Sampling

Algorithm:

1. Choose a constant k so that the following holds

2. Derive a sample s from q(x)

3. Derive a sample u from a uniform distribution ranging [0, kq(s)]

4. If u > then reject the sample. Otherwise, adopt it

kq(x)

u

( )

x p

~

( ) ( )

x

p

x

kq

~

s x

kq(s)

( )

x p

(30)

Markov Chain Monte Carlo

General and powerful framework for sampling

• Scales well with the dimensionality of the sample space

Maintains a state that forms a Markov chain. The

set of the states follows the desired distribution

• Metropolis algorithm

• Metropolis-Hastings algorithm

(31)

Metropolis Algorithm

Assumptions:

• We want samples from a distribution P(X)

• The normalization constant Z may be unknown

Initialization:

1. Prepare a symmetric proposal distribution q(xA|xB) that satisfy q(xA|xB)=q(xB|xA)

2. Prepare an initial state x0

( )

p

( )

X

Z X

(32)

Exercise 4.3

Show that

N

(

x

A

|

x

B

, 1) =

N

(

x

B

|

x

A

,1), where

N

(

x

|

m

,

v

)

is the Gaussian distribution with mean

m

and

variance

v

(

)

(

)

(

)

(

| ,1

)

2 1 exp

2 1

2 1 exp

2 1 1

, |

2 2

A B

A B

B A

B A

x x N x

x

x x

x x N

=    

=

   

=

(33)

Procedure of Metropolis Algorithm

Algorithm:

1. Get a candidate sample x* from the proposal distribution q(x|xt) based on the current state xt

2. Accept the candidate with probability or discard it

3. If the candidate is accepted, save it as the next state xt+1. If it is discarded, then set xt+1 equals to xt

4. Goto step 3

33

( )=  ( )( ) 

t t

x p

x p x

x

A ~ *

~ , 1 min *,

x0

x1

xt x2 x3

x4

(34)

Metropolis-Hastings Algorithm

An extension of the Metropolis algorithm

No symmetric requirement for the proposal

distribution

The acceptance probability of the candidate

state is defined as follows. Other part is the same

as the Metropolis algorithm

(

)

=



( ) (

( ) (

)

)



t k

t

t k t

k

x

x

q

x

p

x

x

q

x

p

x

x

A

|

*

~

*

|

*

~

(35)

Gibbs Sampling

Problem:

• We want samples from a joint distribution p(x1, x2,…, xM)

Algorithm:

1. Prepare an initial state X0 = <x1,x2,…,xM>0

2. Select one of the variables xi in order or at random 3. Get a sample from and update xi with that

value

4. Goto step 2. After enough iterations, the distribution of

xtfollows p(X)

(

x X i

)

p i | \

Compared to the Metropolis algorithm:

• Sampling from conditional distribution of xi given all other variables need to be feasible

(36)
(37)

Variational Method (Outline)

Review of derivatives

• Function: a mapping from a value to a value

37

Set of values Set of values

( )

x

f

( ) ( )

0, 0

0 0

0 = − ≈

x x

x x

x f x

f

0

x

( )

x

f

(38)

Functional and Functional Derivative

Functional

Set of functions Set of values

[ ]

f

F

[ ]

f

F

Ex. Entropy H[p] takes a function p (probability

distribution) and returns a value

If a functional F takes a

maximum/minimum at f0, and

f is “close” to f0, [ ] ( )

0

0

0 = −

f f

f F f

(39)

Euler-Lagrange Equation

( ) ( )

x

f

x

( )

x

f

=

0

+

εη

Suppose F takes minimum/maximum at

f0 . Let η be an arbitral function of x, and ε is a scalar constant

( )

F

[ ]

f

g

ε

=

g(ε) is a (takes and returns a scalar)function of ε

[ ]

f

F

F is a functional of f

[ ]

( )

0

0 0

= ∂∂

=

∂∂

ε

ε=

ε

ε

ε=

g f

F

When ε is closed to 0, f is close to f0. Therefore, .

This must hold for arbitral η.

[ ] [ ]

0

0

0 = −

ε

f F f

F

(40)

Cont.

When ,

F

[ ]

f

=

h

( )

f

dx

+

C

[ ]

( )

( )

( )

x

dx

f

h

dx

f

f

h

dx

x

f

h

f

F

=

=

=

ε

ε

ε

η

( )

=

0

x

dx

f

h

η

must hold for arbitral η.

0

=

f

h

参照

関連したドキュメント

We believe it will prove to be useful both for the user of critical point theorems and for further development of the theory, namely for quick proofs (and in some cases improvement)

Bouziani, Rothe method for a mixed problem with an integral condition for the two-dimensional diffusion equation, Abstr.. Pao, Dynamics of reaction-diffusion equations with

[7] , On initial boundary value problem with Dirichlet integral conditions for a hyperbolic equation with the Bessel operator, J.. Bouziani

Keywords and Phrases: number of limit cycles, generalized Li´enard systems, Dulac-Cherkas functions, systems of linear differential and algebraic equations1. 2001 Mathematical

Theorem 3.5 can be applied to determine the Poincar´ e-Liapunov first integral, Reeb inverse integrating factor and Liapunov constants for the case when the polynomial

The idea of applying (implicit) Runge-Kutta methods to a reformulated form instead of DAEs of standard form was first proposed in [11, 12], and it is shown that the

In Section 3 using the method of level sets, we show integral inequalities comparing some weighted Sobolev norm of a function with a corresponding norm of its symmetric

Second, we want to point out that this relationship could have been proved with less knowledge on the Q-process than required the proof of the theorem.. Consider any Markov process