• 検索結果がありません。

Introduction to Automatic Speech Recognition

N/A
N/A
Protected

Academic year: 2018

シェア "Introduction to Automatic Speech Recognition"

Copied!
38
0
0

読み込み中.... (全文を見る)

全文

(1)

Speech and Language Processing

Lecture 5

Neural network based acoustic and language models

Information and Communications Engineering Course

Takahiro Shinozaki

(2)

Lecture Plan (Shinozaki’s part)

1. 10/20 (remote)

Speech recognition based on GMM, HMM, and N-gram 2. 10/27 (remote)

Maximum likelihood estimation and EM algorithm 3. 11/3 (remote)

Bayesian network and Bayesian inference 4. 11/6 (@TAIST)

Variational inference and sampling

5. 11/7 (@TAIST)

Neural network based acoustic and language models

6. 11/7 (@TAIST)

Weighted finite state transducer (WFST) and speech decoding

(3)

Today’s Topic

Answers for the previous exercises

Neural network based acoustic and language

(4)
(5)

Exercise 4.1

When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(y)

5

( )

y

( )

x

x

x

p

(

)

=

1

0

,

1

,

=

log

1

( )

( )

y

dy dx y x y x y x − = − − = ∞ = → = = → = exp , exp 1 1 , 0 0

)

exp(

)

(

)

(

y

dy

dx

x

p

y

q

=

=

x # o f o cc u rr e n ce

Histogram of x

y

Histogram of y

(6)

Exercise 4.2

When

p

(

x

) and

y

=

f

(

x

) are given as follows, obtain

distribution

q

(y)

(

| 0,1

)

(

,

)

, 3 4 2 exp 2 1 ) ( 2 + = ∞ ∞ − ∈ =    −

= x N x x y x

x p π 3 1 , 3 4 = − = dy dx y

x

(

)

(

2

)

2 2

2 3 | 4,3

4 2 1 exp 3 2 1 ) ( )

( y N y

dy dx x p y q =     − = = π x # o f o cc u rr e n ce

Histogram of x

y

Histogram of y

(7)

Exercise 4.3

Show that

N

(

x

A

|

x

B

, 1) =

N

(

x

B

|

x

A

,1), where

N

(

x

|

m

,

v

)

is the Gaussian distribution with mean

m

and

variance

v

(

)

(

)

(

)

(

| ,1

)

(8)
(9)

Multi Layer Perceptron (MLP)

Unit of MLP

MLP consists of multiple layers of the units

9

   

+

=

i

i

ix b

w h

y

h: activation function w: weight

b:bias

2

x

y

i x

1

x

x x xn

1

y ym

Output layer

Input layer

Hidden layers

(10)

Activation Functions

( )

 

 ≤

=

otherwise 0

0 if

1 x

x h

( )

( )

x x

h

− +

=

exp 1

1

( )

x x

h =

Linear function

Unit step function

Sigmoid function

( )

x

{ }

x

h = max 0,

(11)

Softmax Function

For

N

variables

z

i

, softmax function is:

Properties of softmax

• Positive

• Sum is one

Example

( )

=

j

j i i

z z z

h

) exp(

) exp(

( )

zi h

< 0

( )

1.0

1

=

iN= h zi

Expresses a

probability distribution

( ) ( ) ( ) ( )Z = h z1 ,h z2 ,h z3 = 0.0351, 0.7054, 0.2595

h

1,2,1

-,

, 2 3

1 =

= z z z Z

16,8,12 ,

, 2 3

1 =

= z z z

(12)

Exercise 5.1

Let h be a softmax function having inputs

z

1

,

z

2

,…,

z

N

.

Prove that

( )

=

j j i i z z z h ) exp( ) exp(

( )

1.0

1

=

iN= h zi

( )

1

(13)

Forward Propagation

Compute the output of MLP step by step from the

input layer to output layer

E.g. softmax layer

E.g. sigmoid layer

E.g. sigmoid layer

(14)

Parameters of Neural Network

The weights and a bias of each unit need training

before the network is used

( )

wx

=

   

+

=

h

b x

w h

y

i

i i

2

x

y

N x

1

x 1

1

w

2

w wN

b h: activation function

w: weight vector

w=(w1,w2,…,wN,b) X: input vector

x=(x1,x2,…,xN,1)

(15)

Principle of NN Training

Training set

Reference output vector

Input vector

Adjust parameters of MLP so as to minimize the error

(16)

Definitions of Errors

Sum of square error

• Used when output layer uses linear functions

Cross-entropy

• Used when the output layer is a softmax

( )

(

)

2

, 2

1

=

n

n n W t X

y W

E

n t

X W

n n

:Set of weights in MLP

:Vector of a training sample (input) :Vector of a training sample (output) :Index of training samples

( )

= −

∑∑

{

(

)

}

n k

n nk

nk y X W

t W

E ln ,

k

t

nk :Reference output (Takes 1 if unit k

corresponds to correct output, 0 otherwise)

(17)

Gradient Descent

An iterative optimization method

f

(

x

)

x

( )

t t

t

x

x

f

x

x

=

+1

ε

ε

:Learning rate

(small positive value)

x0 x1

x2 xN

Initial value

( )

0

x x

x x f

=

(18)

MLP Training by Gradient Descent

Define an error measure

E

(

W

) for training samples

Initialize parameters

W

={

w

1

,

w

2

,…,

w

M

}

Repeatedly update the parameter set using

gradient descent

( )

(

)

2

, 2

1

− =

n

n n W t X

y W

E

e.g.

( ) ( )

( )

( )

t w

w i

i i

i i

w

W

E

t

w

t

w

=

=

(19)

Chain Rule of Differentiation

x

y

y

z

x

z

=

)

(

)

(

x

g

y

y

f

z

=

=

g f x y z

 When x,y,z are scalars:

 When x,y,z are vectors:

2 1 2 1 3 2

1, x , x , y y , y , z z , z

x

x = = =

            ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂∂ ∂ ∂             ∂∂ ∂ ∂ ∂ ∂ ∂∂ =             ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂∂ ∂∂ 3 2 2 2 1 2 3 1 2 1 1 1 2 2 1 2 2 1 1 1 3 2 2 2 1 2 3 1 2 1 1 1 x y x y x y x y x y x y y z y z y z y z x z x z x z x z x z x z Jacobian matrix

x

y

y

z

x

z

=

(20)

When There Are Branches

) ( ) ( ) , ( 2 2 1 1 2 1 x g y x g y y y f z = = = g 1 f x y1 z g 2 y2

x

y

y

z

x

y

y

z

x

z

+

=

2 2

1 1 g 1 f y1 z y2 f x y1 z2 y2 Variations: x x

g ( ) =

1 g (x) = C

2

(independent of x)

x y y z y z x z ∂ ∂ ∂∂ + ∂∂ =

∂∂ 1 2 2

(21)

Back Propagation(BP)

21

r

( 3 4) 4

4 f y , w

y =

( 2 3)

3

3 f y , w

y =

( 1 2)

2

2 f y , w

y =

( 1)

1

1 f x, w

y =

x

( )y r E

Err = 4,

( 4 3)

4 softmax w y

y = ⋅

Ex.:

( 3 2)

3 sigmoid w y

y = ⋅

Ex.: f2 f1 f3 f4 E x w1 w2 w3 w4 r ref

Input y1

2 y 3 y 4 y Err 4 4 4 4 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 3 3 3 3 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 2 2 2 2 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 1 1 1 1 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 3 4 4 3 f f f Err f Err ∂ ∂ ∂ ∂ = ∂ ∂ 4 f Err ∂ ∂ 2 3 3 2 f f f Err f Err ∂∂ ∂ ∂ = ∂ ∂ 1 2 2 1 f f f Err f Err ∂ ∂ ∂ ∂ = ∂ ∂

①obtain value of each node

by forward propagation

②Obtain derivatives by

backward propagation

(22)

Feed-Forward Neural Network

• When the network structure is a DAG, it is called feed-forward network

• The nodes are ordered in a line so that all connections have the same direction

• The forward/backward propagation can be efficiently applied

(23)

Exercise 5.2

When h(y) and y(x) are given as follows, obtain

( )

( )

y y h − + = exp 1 1 b ax y = +

x h ∂ ∂

( ) (

)

(

( )

( )

)

(

)

(

)

(

)

(

)

(

ax b

)

a

(

h

(

ax b

)

) (

h ax b

)

(24)

Recurrent Neural Network (RNN)

Neural network having a feedback

• Expected to be more powerful modeling performance than feed-forward MLP, but the training is more difficult

Delay

Input

Output

Input layer

Output layer

(25)

Unfolding of RNN to Time Axis

D

Unfold

Through Time

Input feature sequence

Time

(26)

Training of RNN by BP Through Time (BPTT)

2 y

3 y

4

Input

Regard the input sequence as an input

Output

(Regard the output sequence as an output)

Input sequence

Output sequence Ba

ck

-p

ro

p

ag

a

tio

n

hh2 h3 h4

xx2 x3 x4

2 y

3 y

4

h4

h3

h2

h

xx2

x3 x4

(27)

x

1 − t y 1 − t c

Long Short-Term Memory (LSTM)

27

LSTM

A type of RNN addressing the gradient vanishing problem t x t y Delay Delay t c 1 − t c 1 − t y t c t y

σ tanh σ

tanh σ Output gate Input gate forget gate

σ

tanh Tanh layer with affine transform

Sigmoid layer with affine transform

Pointwise multiplication

(28)

Convolutional Neural Network (CNN)

1 3 3 4 2 1 3 5 2 1 3 5

Input

Filter (1)

Filter (2)

Filter (3)

Activation map (1)

5 4 5

Pooling

Next

convolution layer etc. Activation map (2)

Activation map (N)

A type of feed-forward neural network with parameter sharing and connection constraint

(29)

Deep Neural Network

(Just a) Neural network with many hidden layers

• 3〜5 < # of layers

Training was difficult until recently

• Improvements in training algorithms: Pre-training, Dropout

• Improvements in computer hardware: GPGPU

Year 2011:

• Large performance gains have been reported for large vocabulary speech recognition

Deep Learning Fever!

(30)
(31)

Frame Level Vowel Recognition Using MLP

( )

p p( )い p( )う p( )え p( )お

Softmax function

Sigmoid function

Sigmoid function

Input: Speech feature vector (e.g. MFCC)

(32)

Combination of HMM and MLP

s

1

s

3 s4

s0

s

2

MLP-HMM

s

1

s

3 s4

s0

s

2

GMM-HMM

Softmax layer

( ) ( )

( )

( )

( )

s p

X s MLP s

p X s p s

X

p | ∝ | = |

( )

X s GMM

( )

X

(33)

MLP-HMM based Phone Recognizer

/a/ /i/ /N/

Softmax

Sigmoid

Sigmoid

Input speech feature

(34)
(35)

Word Vector

One-of-K representation of a word for a fixed vocabulary

word ID 1-of-K

(36)

Word Prediction Using RNN

<0.02 ,0.65, 0.14 ,0.11, 0.05, 0.01 ,0.02>

<0, 0, 0, 1, 0, 0, 0>

D

Wordt-1

(37)

RNN Language Model (Unfolded)

</s>

<s>

Big

Delicious Red Apple

(38)

Dialogue System Using Seq2Seq Network

What is your name <s>

name

Encoder network

Decoder network

</s> My is TS-800

Sampling from posterior

Input

参照

関連したドキュメント

Recognition process with a laser-assisted range sensor(B) 3.1 Principle of coil profile measurement This system is only appii~ble fm the case where the coils are all

The connection weights of the trained multilayer neural network are investigated in order to analyze feature extracted by the neural network in the learning process. Magnitude of

In the present paper, the methods of independent component analysis ICA and principal component analysis PCA are integrated into BP neural network for forecasting financial time

Yang, Some growth relationships on factors of two composite entire functions, Factorization Theory of Meromorphic Functions and Related Topics, Marcel Dekker Inc.. Sato, On the rate

In a previous paper [1] we have shown that the Steiner tree problem for 3 points with one point being constrained on a straight line, referred to as two-point-and-one-line Steiner

In the previous discussions, we have found necessary and sufficient conditions for the existence of traveling waves with arbitrarily given least spatial periods and least temporal

Indeed, when GCD(p, n) = 2, the Gelfand model M splits also in a differ- ent way as the direct sum of two distinguished modules: the symmetric submodule M Sym , which is spanned by

Arnold This paper deals with recent applications of fractional calculus to dynamical sys- tems in control theory, electrical circuits with fractance, generalized voltage di-