Speech and Language Processing
Lecture 5
Neural network based acoustic and language models
Information and Communications Engineering Course
Takahiro Shinozaki
Lecture Plan (Shinozaki’s part)
1. 10/20 (remote)
Speech recognition based on GMM, HMM, and N-gram 2. 10/27 (remote)
Maximum likelihood estimation and EM algorithm 3. 11/3 (remote)
Bayesian network and Bayesian inference 4. 11/6 (@TAIST)
Variational inference and sampling
5. 11/7 (@TAIST)
Neural network based acoustic and language models
6. 11/7 (@TAIST)
Weighted finite state transducer (WFST) and speech decoding
Today’s Topic
•
Answers for the previous exercises
•
Neural network based acoustic and language
Exercise 4.1
•
When
p
(
x
) and
y
=
f
(
x
) are given as follows, obtain
distribution
q
(y)
5
( )
y
( )
x
x
x
p
(
)
=
1
∈
0
,
1
,
=
−
log
1
−
( )
( )
ydy dx y x y x y x − = − − = ∞ = → = = → = exp , exp 1 1 , 0 0
)
exp(
)
(
)
(
y
dy
dx
x
p
y
q
=
=
−
x # o f o cc u rr e n ce
Histogram of x
y
Histogram of y
Exercise 4.2
•
When
p
(
x
) and
y
=
f
(
x
) are given as follows, obtain
distribution
q
(y)
(
| 0,1)
(
,)
, 3 4 2 exp 2 1 ) ( 2 + = ∞ ∞ − ∈ = −= x N x x y x
x p π 3 1 , 3 4 = − = dy dx y
x
(
)
(
2)
2 2
2 3 | 4,3
4 2 1 exp 3 2 1 ) ( )
( y N y
dy dx x p y q = − − = = π x # o f o cc u rr e n ce
Histogram of x
y
Histogram of y
Exercise 4.3
•
Show that
N
(
x
A|
x
B, 1) =
N
(
x
B|
x
A,1), where
N
(
x
|
m
,
v
)
is the Gaussian distribution with mean
m
and
variance
v
(
)
(
)
(
)
(
| ,1)
Multi Layer Perceptron (MLP)
•
Unit of MLP
•
MLP consists of multiple layers of the units
9
+
=
∑
i
i
ix b
w h
y
h: activation function w: weight
b:bias
2
x
y
i x
1
x
x x xn
1
y ym
Output layer
Input layer
Hidden layers
Activation Functions
( )
≤
=
otherwise 0
0 if
1 x
x h
( )
( )
x x
h
− +
=
exp 1
1
( )
x xh =
Linear function
Unit step function
Sigmoid function
( )
x{ }
xh = max 0,
Softmax Function
•
For
N
variables
z
i, softmax function is:
•
Properties of softmax
• Positive
• Sum is one
•
Example
( )
=∑
j
j i i
z z z
h
) exp(
) exp(
( )
zi h< 0
( )
1.01
=
∑
iN= h ziExpresses a
probability distribution
( ) ( ) ( ) ( )Z = h z1 ,h z2 ,h z3 = 0.0351, 0.7054, 0.2595
h
1,2,1
-,
, 2 3
1 =
= z z z Z
16,8,12 ,
, 2 3
1 =
= z z z
Exercise 5.1
•
Let h be a softmax function having inputs
z
1,
z
2,…,
z
N.
•
Prove that
( )
=∑
j j i i z z z h ) exp( ) exp(( )
1.01
=
∑
iN= h zi( )
1Forward Propagation
•
Compute the output of MLP step by step from the
input layer to output layer
E.g. softmax layer
E.g. sigmoid layer
E.g. sigmoid layer
Parameters of Neural Network
•
The weights and a bias of each unit need training
before the network is used
( )
w⋅x=
+
=
∑
h
b x
w h
y
i
i i
2
x
y
N x
1
x 1
1
w
2
w wN
b h: activation function
w: weight vector
w=(w1,w2,…,wN,b) X: input vector
x=(x1,x2,…,xN,1)
Principle of NN Training
Training set
Reference output vector
Input vector
Adjust parameters of MLP so as to minimize the error
Definitions of Errors
•
Sum of square error
• Used when output layer uses linear functions
•
Cross-entropy
• Used when the output layer is a softmax
( )
(
)
2, 2
1
∑
−=
n
n n W t X
y W
E
n t
X W
n n
:Set of weights in MLP
:Vector of a training sample (input) :Vector of a training sample (output) :Index of training samples
( )
= −∑∑
{
(
)
}
n k
n nk
nk y X W
t W
E ln ,
k
t
nk :Reference output (Takes 1 if unit kcorresponds to correct output, 0 otherwise)
Gradient Descent
•
An iterative optimization method
f
(
x
)
x
( )
t t
t
x
x
f
x
x
∂
∂
−
=
+1
ε
ε
:Learning rate
(small positive value)
x0 x1
x2 xN
Initial value
( )
0
x x
x x f
=
MLP Training by Gradient Descent
•
Define an error measure
E
(
W
) for training samples
•
Initialize parameters
W
={
w
1,
w
2,…,
w
M}
•
Repeatedly update the parameter set using
gradient descent
( )
(
)
2, 2
1
∑
− =n
n n W t X
y W
E
e.g.
( ) ( )
( )
( )
t ww i
i i
i i
w
W
E
t
w
t
w
=
∂
∂
−
=
Chain Rule of Differentiation
x
y
y
z
x
z
∂
∂
∂
∂
=
∂
∂
)
(
)
(
x
g
y
y
f
z
=
=
g f x y z When x,y,z are scalars:
When x,y,z are vectors:
2 1 2 1 3 2
1, x , x , y y , y , z z , z
x
x = = =
∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂∂ ∂ ∂ ∂∂ ∂ ∂ ∂ ∂ ∂∂ = ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂∂ ∂∂ 3 2 2 2 1 2 3 1 2 1 1 1 2 2 1 2 2 1 1 1 3 2 2 2 1 2 3 1 2 1 1 1 x y x y x y x y x y x y y z y z y z y z x z x z x z x z x z x z Jacobian matrix
x
y
y
z
x
z
∂
∂
∂
∂
=
When There Are Branches
) ( ) ( ) , ( 2 2 1 1 2 1 x g y x g y y y f z = = = g 1 f x y1 z g 2 y2x
y
y
z
x
y
y
z
x
z
∂
∂
∂
∂
+
∂
∂
∂
∂
=
∂
∂
2 21 1 g 1 f y1 z y2 f x y1 z g2 y2 Variations: x x
g ( ) =
1 g (x) = C
2
(independent of x)
x y y z y z x z ∂ ∂ ∂∂ + ∂∂ =
∂∂ 1 2 2
Back Propagation(BP)
21
r
( 3 4) 4
4 f y , w
y =
( 2 3)
3
3 f y , w
y =
( 1 2)
2
2 f y , w
y =
( 1)
1
1 f x, w
y =
x
( )y r E
Err = 4,
( 4 3)
4 softmax w y
y = ⋅
Ex.:
( 3 2)
3 sigmoid w y
y = ⋅
Ex.: f2 f1 f3 f4 E x w1 w2 w3 w4 r ref
Input y1
2 y 3 y 4 y Err 4 4 4 4 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 3 3 3 3 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 2 2 2 2 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 1 1 1 1 w f f Err w Err ∂∂ ∂ ∂ = ∂ ∂ 3 4 4 3 f f f Err f Err ∂ ∂ ∂ ∂ = ∂ ∂ 4 f Err ∂ ∂ 2 3 3 2 f f f Err f Err ∂∂ ∂ ∂ = ∂ ∂ 1 2 2 1 f f f Err f Err ∂ ∂ ∂ ∂ = ∂ ∂
①obtain value of each node
by forward propagation
②Obtain derivatives by
backward propagation
Feed-Forward Neural Network
• When the network structure is a DAG, it is called feed-forward network
• The nodes are ordered in a line so that all connections have the same direction
• The forward/backward propagation can be efficiently applied
Exercise 5.2
When h(y) and y(x) are given as follows, obtain
( )
( )
y y h − + = exp 1 1 b ax y = +x h ∂ ∂
( ) (
)
(
( )
( )
)
(
)
(
)
(
)
(
)
(
ax b)
a(
h(
ax b)
) (
h ax b)
Recurrent Neural Network (RNN)
•
Neural network having a feedback
• Expected to be more powerful modeling performance than feed-forward MLP, but the training is more difficult
Delay
Input
Output
Input layer
Output layer
Unfolding of RNN to Time Axis
D
Unfold
Through Time
Input feature sequence
Time
Training of RNN by BP Through Time (BPTT)
y
1
y
2 y
3 y
4
Input
Regard the input sequence as an input
Output
(Regard the output sequence as an output)
Input sequence
Output sequence Ba
ck
-p
ro
p
ag
a
tio
n
h1 h2 h3 h4
x1 x2 x3 x4
y
1
y
2 y
3 y
4
h4
h3
h2
h1
x1 x2
x3 x4
x
⊗
1 − t y 1 − t cLong Short-Term Memory (LSTM)
27
LSTM
A type of RNN addressing the gradient vanishing problem t x t y Delay Delay t c 1 − t c 1 − t y t c t y
⊗
⊕
σ tanh σ⊗
tanh σ Output gate Input gate forget gate⊗
⊕
σtanh Tanh layer with affine transform
Sigmoid layer with affine transform
Pointwise multiplication
Convolutional Neural Network (CNN)
1 3 3 4 2 1 3 5 2 1 3 5
Input
Filter (1)
Filter (2)
Filter (3)
Activation map (1)
5 4 5
Pooling
Next
convolution layer etc. Activation map (2)
Activation map (N)
A type of feed-forward neural network with parameter sharing and connection constraint
Deep Neural Network
•
(Just a) Neural network with many hidden layers
• 3〜5 < # of layers
•
Training was difficult until recently
• Improvements in training algorithms: Pre-training, Dropout
• Improvements in computer hardware: GPGPU
•
Year 2011:
• Large performance gains have been reported for large vocabulary speech recognition
Deep Learning Fever!
Frame Level Vowel Recognition Using MLP
( )
あp p( )い p( )う p( )え p( )お
Softmax function
Sigmoid function
Sigmoid function
Input: Speech feature vector (e.g. MFCC)
Combination of HMM and MLP
s
1s
3 s4s0
s
2
MLP-HMM
s
1s
3 s4s0
s
2
GMM-HMM
Softmax layer
( ) ( )
( )
( )
( )
s p
X s MLP s
p X s p s
X
p | ∝ | = |
( )
X s GMM( )
XMLP-HMM based Phone Recognizer
/a/ /i/ /N/
Softmax
Sigmoid
Sigmoid
Input speech feature
Word Vector
•
One-of-K representation of a word for a fixed vocabulary
word ID 1-of-K
Word Prediction Using RNN
<0.02 ,0.65, 0.14 ,0.11, 0.05, 0.01 ,0.02>
<0, 0, 0, 1, 0, 0, 0>
D
Wordt-1
RNN Language Model (Unfolded)
</s>
<s>
Big
Delicious Red Apple
Dialogue System Using Seq2Seq Network
What is your name <s>
name
Encoder network
Decoder network
</s> My is TS-800
Sampling from posterior
Input