Introduction to Automatic Speech Recognition

(1)

Speech and Language Processing

Lecture 5

Neural network based acoustic and language models

Information and Communications Engineering Course

Takahiro Shinozaki

(2)

Lecture Plan (Shinozaki’s part)

1. 10/20 (remote)

Speech recognition based on GMM, HMM, and N-gram 2. 10/27 (remote)

Maximum likelihood estimation and EM algorithm 3. 11/3 (remote)

Bayesian network and Bayesian inference 4. 11/6 (@TAIST)

Variational inference and sampling

5. 11/7 (@TAIST)

Neural network based acoustic and language models

6. 11/7 (@TAIST)

Weighted finite state transducer (WFST) and speech decoding

(3)

Today’s Topic

• Answers for the previous exercises

• Neural network based acoustic and language

(4)

(5)

Exercise 4.1

• When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(y)

5

( )

y

( )

x

p

(

)

=

1 ∈

0 ,

1 ,

=

−

log

1 −

( )

y

dy dx y x y x y x − = − − = ∞ = → = = → = exp , exp 1 1 , 0 0

)

exp(

)

(

)

(

y

dy

dx

x

p

y

q

=

−

x # o f o cc u rr e n ce

Histogram of x

y

Histogram of y

(6)

Exercise 4.2

• When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(y)

(

| 0,1

)

(

,

)

, 3 4 2 exp 2 1 ) ( 2 + = ∞ ∞ − ∈ =    −

= x N x x y x

x p π 3 1 , 3 4 = − = dy dx y

x

(

)

(

2

)

2 2

2 ₃ | 4,3

4 2 1 exp 3 2 1 ) ( )

( y N y

dy dx x p y q =     ₋ − = = π x # o f o cc u rr e n ce

Histogram of x

y

Histogram of y

(7)

Exercise 4.3

• Show that

N

(

x

_A

|

x

_B

, 1) =

N

(

x

_B

|

x

_A

,1), where

N

(

x

|

m

,

v

)

is the Gaussian distribution with mean

m

and

variance

v

(

)

(

)

(

)

(

| ,1

)

(8)

(9)

Multi Layer Perceptron (MLP)

• Unit of MLP

• MLP consists of multiple layers of the units

9

   



 ₊

=

∑

i

ix b

w h

y

h: activation function w: weight

b：bias

2

x

y

i x

1

x

x x x_n

1

y _y_m

Output layer

Input layer

Hidden layers

(10)

Activation Functions

( )

 

 ≤

=

otherwise 0

0 if

1 x

x h

( )

_{( )}

x x

h

− +

=

exp 1

1

( )

x x

h =

Linear function

Unit step function

Sigmoid function

( )

x

{ }

x

h = max 0,

(11)

Softmax Function

• For

N

variables

z

_i

, softmax function is:

• Properties of softmax

• Positive

• Sum is one

• Example

( )

=

_∑

j

j i i

z z z

h

) exp(

( )

zi h

< 0

( )

1.0

1

=

∑

_iN₌ h zi

Expresses a

probability distribution

( ) ( ) ( ) ( )Z = h z₁ ,h z₂ ,h z₃ = 0.0351, 0.7054, 0.2595

h

1,2,1

-,

, ₂ ₃

1 =

= z z z Z

16,8,12 ,

, ₂ ₃

1 =

= z z z

(12)

Exercise 5.1

• Let h be a softmax function having inputs

z

₁

,

z

₂

,…,

z

_N

.

• Prove that

( )

=

_∑

j j i i z z z h ) exp( ) exp(

( )

1.0

1

=

∑

_iN₌ h zi

( )

1

(13)

Forward Propagation

• Compute the output of MLP step by step from the

input layer to output layer

E.g. softmax layer

E.g. sigmoid layer

(14)

Parameters of Neural Network

• The weights and a bias of each unit need training

before the network is used

( )

w⋅x

=

   



 ₊

=

∑

h

b x

w h

y

i

i i

2

x

y

N x

1

x 1

1

w

2

w w_N

b _{h: activation function}

w: weight vector

w=(w₁,w₂,…,w_N,b) X: input vector

x=(x₁,x₂,…,x_N,1)

(15)

Principle of NN Training

Training set

Reference output vector

Input vector

Adjust parameters of MLP so as to minimize the error

(16)

Definitions of Errors

• Sum of square error

• Used when output layer uses linear functions

• Cross-entropy

• Used when the output layer is a softmax

( )

(

)

2

, 2

1

∑

−

=

n

n n W t X

y W

E

n t

X W

n n

：Set of weights in MLP

：Vector of a training sample (input) ：Vector of a training sample (output) ：Index of training samples

( )

= −

_∑∑

{

(

)

}

n k

n nk

nk y X W

t W

E ln ,

k

t

_nk ：Reference output (Takes 1 if unit k

corresponds to correct output, 0 otherwise)

(17)

Gradient Descent

• An iterative optimization method

f

(

x

)

x

( )

t t

t

x

f

x

∂

−

=

+1

ε

：Learning rate

(small positive value)

x₀ x₁

x₂ x_N

Initial value

( )

0

x x

x x f

=

(18)

MLP Training by Gradient Descent

• Define an error measure

E

(

W

) for training samples

• Initialize parameters

W

={

w

₁

,

w

₂

,…,

w

_M

}

• Repeatedly update the parameter set using

gradient descent

( )

(

)

2

, 2

1

∑

− =

n

n n W t X

y W

E

e.g.

( ) ( )

( )

t w

w i

i i

w

W

E

t

w

t

w

=

∂

−

=

(19)

Chain Rule of Differentiation

x

y

z

x

z

∂

=

∂

)

(

)

(

x

g

y

f

z

=

ｇ f x y z

 When x,y,z are scalars:

 When x,y,z are vectors:

2 1 2 1 3 2

1, x , x , y y , y , z z , z

x

x = = =

            ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂∂ ∂ ∂             ∂∂ ∂ ∂ ∂ ∂ ∂∂ =             ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂∂ ∂∂ 3 2 2 2 1 2 3 1 2 1 1 1 2 2 1 2 2 1 1 1 3 2 2 2 1 2 3 1 2 1 1 1 x y x y x y x y x y x y y z y z y z y z x z x z x z x z x z x z Jacobian matrix

x

y

z

x

z

∂

=

(20)

When There Are Branches

) ( ) ( ) , ( 2 2 1 1 2 1 x g y x g y y y f z = = = ｇ 1 f x y₁ z ｇ 2 y₂

x

y

z

x

y

z

x

z

∂

+

∂

=

∂

2 2

1 1 ｇ 1 f y₁ z y₂ f x y₁ z ｇ₂ y₂ Variations： x x

g ( )　=

1 g (x)　= C

2

(independent of x)

x y y z y z x z ∂ ∂ ∂∂ + ∂∂ =

∂∂ 1 2 2

(21)

Back Propagation(BP)

21

r

( 3 4) 4

4 f y , w

y =

( 2 3)

3

3 f y , w

y =

( 1 2)

2

2 f y , w

y =

( 1)

1

1 f x, w

y =

x

( )y r E

Err = ₄,

( 4 3)

4 softmax w y

y = ⋅

Ex.：

( 3 2)

3 sigmoid w y

y = ⋅

Ex.： f₂ f₁ f₃ f₄ E x w₁ w₂ w₃ w₄ r ref

Input y1

2 y 3 y 4 y Err 4 4 4 4 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 3 3 3 3 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 2 2 2 2 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 1 1 1 1 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 3 4 4 3 f f f Err f Err ∂ ∂ ∂ ∂ = ∂ ∂ 4 f Err ∂ ∂ 2 3 3 2 f f f Err f Err ∂∂ ∂ ∂ = ∂ ∂ 1 2 2 1 f f f Err f Err ∂ ∂ ∂ ∂ = ∂ ∂

①obtain value of each node

by forward propagation

②Obtain derivatives by

backward propagation

(22)

Feed-Forward Neural Network

• When the network structure is a DAG, it is called feed-forward network

• The nodes are ordered in a line so that all connections have the same direction

• The forward/backward propagation can be efficiently applied

(23)

Exercise 5.2

When h(y) and y(x) are given as follows, obtain

( )

_{( )}

y y h − + = exp 1 1 b ax y = +

x h ∂ ∂

( ) (

)

₍

( )

_{( )}

₎

(

)

(

)

(

)

(

)

(

ax b

)

a

(

h

(

ax b

)

) (

h ax b

)

(24)

Recurrent Neural Network (RNN)

• Neural network having a feedback

• Expected to be more powerful modeling performance than feed-forward MLP, but the training is more difficult

Delay

Input

Output

Input layer

Output layer

(25)

Unfolding of RNN to Time Axis

D

Unfold

Through Time

Input feature sequence

Time

(26)

Training of RNN by BP Through Time (BPTT)

ｙ

１

ｙ

2 ｙ

3 ｙ

4

Input

Regard the input sequence as an input

Output

（Regard the output sequence as an output）

Input sequence

Output sequence B_a

ck

-p

ro

p

ag

a

tio

n

h１ h2 h3 h4

x１ x2 x3 x4

ｙ

１

ｙ

2 ｙ

3 ｙ

4

h₄

h₃

h₂

h１

x１ x2

x₃ x₄

(27)

x

⊗

1 − t y 1 − t c

Long Short-Term Memory (LSTM)

27

LSTM

A type of RNN addressing the gradient vanishing problem t x t y Delay Delay t c 1 − t c 1 − t y t c t y

⊗

⊕

σ tanh σ

⊗

tanh σ Output gate Input gate forget gate

⊗

⊕

σ

tanh Tanh layer with _{affine transform}

Sigmoid layer with affine transform

Pointwise multiplication

(28)

Convolutional Neural Network (CNN)

1 3 3 4 2 1 3 5 2 1 3 5

Input

Filter (1)

Filter (2)

Filter (3)

Activation map (1)

5 4 5

Pooling

Activation map (N)

A type of feed-forward neural network with parameter sharing and connection constraint

(29)

Deep Neural Network

• (Just a) Neural network with many hidden layers

• 3〜5 < # of layers

• Training was difficult until recently

• Improvements in training algorithms: Pre-training, Dropout

• Improvements in computer hardware: GPGPU

• Year 2011:

• Large performance gains have been reported for large vocabulary speech recognition

Deep Learning Fever!

(30)

(31)

Frame Level Vowel Recognition Using MLP

( )

あ

p p( )い p( )う p( )え p( )お

Softmax function

Sigmoid function

Input: Speech feature vector (e.g. MFCC)

(32)

Combination of HMM and MLP

s

1

s

3 s4

s₀

s

2

MLP-HMM

s

1

s

3 s4

s₀

s

2

GMM-HMM

Softmax layer

( ) ( )

_{( )}

( )

_{( )}

s p

X s MLP s

p X s p s

X

p | ∝ | = |

( )

X s GMM

( )

X

(33)

MLP-HMM based Phone Recognizer

/a/ _/i/ _/N/

Softmax

Sigmoid

Input speech feature

(34)

(35)

Word Vector

• One-of-K representation of a word for a fixed vocabulary

word ID 1-of-K

(36)

Word Prediction Using RNN

<0.02 ,0.65, 0.14 ,0.11, 0.05, 0.01 ,0.02>

<0, 0, 0, 1, 0, 0, 0>

D

Word_t-1

(37)

RNN Language Model (Unfolded)

</s>

<s>

Big

Delicious Red Apple

(38)

Dialogue System Using Seq2Seq Network

What is your name <s>

name

Encoder network

Decoder network

</s> My is TS-800

Sampling from posterior

Input