Speech and Language Processing
Lecture 1
Speech recognition based on GMM, HMM, and N-gram
Information and Communications Engineering Course
Lecture Plan (Shinozaki’s part)
1. Speech recognition based on GMM, HMM, and N-gram
2. Maximum likelihood estimation and EM algorithm
3. Bayesian network and Bayesian inference
4. Variational inference and sampling
5. Neural network based acoustic and language models
6. Weighted finite state transducer (WFST) and speech
decoding
Handouts
•
All the materials are available at my home page:
Perception of Speech Sound
•
Speech sound is very special for human. You may feel
it is completely different from non-speech sound
•
However, it is still a sound made by a physical process
•
Frequency and time pattern are important
Speech sound pronunciating “Onsei”
Cut the region near “n” and
Utterance Generation
Vocal cords
Trachea Mouth
Lungs
The air flows in the pipe The membrane vibrate and generates beep sound G(ω)
Transfer characteristics H(ω) is modulated by moving the mouth
Spectral Analysis
Fourier transform
Frequency (Hz)
0 1000 2000 3000 4000
Lo
g
Po
w
e
Spectral Envelope
Spectral envelope corresponds to H(ω) and contains
phonological information
Vowels and Spectral Envelopes
8
/a/ /i/
Experiment: Replacing the Sound Source
G
G(ω) H(ω)
1. Record a voice
2. Analyze the voice, and decompose it to the sound source G and the transmission characteristics H
X(ω)
(original voice)
=
×4. Replace the sound source G with another one G’
5. Compute G’(ω) ×H(ω) and re-generate waveform
E.g:sawtooth wave
Synthesized Voice Changing
G
10
Sawtooth G’ (100Hz)
Sawtooth G’ (300Hz)
Sawtooth G’ (500Hz)
( ) ( )
ω
H
ω
G
'
( ) ( )
ω
H
ω
G
'
( ) ( )
ω
H
ω
G
'
Music G’ (Acoustic11*)
*The music is from: http://maoudamashii.jokersounds.com/
( ) ( )
ω
H
ω
History of Speech Technology
1850 1900 1950 2000
1857
1876 Telephone
Radio broad cast
ENIAC 1946
Vocoder 1939 1800
Phonautograph 1791
Kempelen’s talking machine
1969 Internet
2011 Siri on iOS
1952
Speech recognizer (Digit recognition)
1990 www
1920s Radio Rex
Applications of Speech Recognition
•
Smartphone
• Voice assistance
• Speech-to-speech translation
•
Judge
• Speech retrieval system to support
citizen judge
•
Television
• Automatic captioning system
•
Car navigation
• Voice commands
•
Toy robots
Feature Extraction
Extract useful information from the input signal in a
convenient form for pattern recognition
•
Help improving pattern recognition performance
•
Reduce unnecessary memory and processing costs
Feature extraction
Time
Mel-Frequency Cepstrum (MFC)
Mel-Scale Filter Bank
Windowing
|DFT|
Mel-Filter Bank
Log IDFT Liftering
波形の標本値系列
Speech sound
• Widely used features for speech
recognition
• Emulate perceptual scale of pitches
Typical Feature Extraction Process
16
16kHz sampling 16bit quantization
Window width: 32ms (=512samples/16kHz)
shift: 10ms
Feature sequence
Sequence of real valued vectors Rate=100Hz
A vector is called a “frame”
Time
Speech Decoding
O
:
Input acoustic features (or a feature sequence)
W
: A
symbol (or a symbol sequence) to recognize
e.g. phone, word, word sequence, etc.
O
Pattern
Statistical Speech Recognition
•
Use probability distribution to model speech sounds
(
W
O
)
P
W
W
|
max
arg
ˆ
=
Acoustic Model and Language Model
•
By using the Bayes’ theorem, the probability is
decomposed to two parts
•
P(
O
) is independent of the maximization of
W
, and
can be ignored
(
)
(
) ( )
( )
Problem Settings
•
Frame-wise vowel recognition
• O: feature vector of a single frame• W: One of the vowels (For Japanese: a,i,u,e,o)
•
Isolated phone recognition
• O: A sequence of feature vectors of a segment of
phone utterance
• W: One of the phones
•
Isolated word recognition
• O: A sequence of feature vectors of a segment of
word utterance
• W: One of the words in a vocabulary
•
Continuous word recognition
Frame-wise Vowel Recognition
(
)
(
) ( )
( ) ( )
O
P
W
P
W
P
W
O
P
O
W
P
W
W W W Wmax
arg
|
max
arg
|
max
arg
ˆ
=
=
=
W
={a,i,u,e,o}
O:
a feature vector
The conditional probability can be modeled by
preparing Pw(O) separately Time a i u i
Acoustic and Language Modeling
•
Acoustic model (AM)
•
Language model (LM)
( )
O
P
W( )
W
P
Probability distribution of O given W
Probability distribution of W Ex.
Gaussian distribution,
Gaussian mixture model, etc.
Ex.
Gaussian Distribution
•
Defined by two parameters mean
μ
and standard
deviation
σ
(
σ
2is variance)
(
)
(
)
− − = 2 2 2 2 2 1 exp 2 1 , | µ σ πσ σ µ x x NIt satisfies:
(
)
∫
∞(
)
∞
−
=
<
|
,
,
|
,
1
0
N
x
µ
σ
2N
x
µ
σ
2dx
0 0.2 0.4 0.6 0 0.2 0.4 0.6
( )X
Multivariate Gaussian Distribution
•
For D-dimensional vector
x
, it is defined using a
mean vector
μ
and a covariance matrix
Σ
:
24
(
)
( )
(
) (
)
−
−
−
=
x
μ
S
−x
μ
S
S
μ
x
12
1
exp
|
|
2
1
,
|
TD
N
π
|S| denotes determinant of S
Contour plot of an example of a two dimensional Gaussian distribution
Gaussian Distribution based AM
•
Fit a Gaussian distribution for each vowel
( )
(
)
(
)
− − = = 2 2 2 2 2 1 exp 2 1 , | w w w w ww O N x x
P µ σ πσ σ µ a e i o
u We will consider this problem later in the lecture of maximum
Gaussian Mixture Model (GMM)
• By mixing multiple Gaussian distributions, a complex
distribution can be expressed
Useful to improve recognition performance
( )
=
∑
(
)
i
i i
i
i
N
X
S
w
X
GMM
|
µ
,
i
w
i
N
: Mixture weight
: Component Gaussian distribution with
mean μi and covariance Si
0
.
1
1
=
∑
=
M
m
k
Categorical Distribution
•
The distribution is represented by a table
•
The probability distribution of a skewed die is an
example of categorical distribution
Vowel a i u e o
1-of-K Representation
•
The same probability as the table description can
be represented as an equation by using 1-of-K
representation
28
Value W
1-of-K representation W=(w1, w2, w3, w4, w5)
Probability
ρ=(ρ1, ρ2, ρ3,ρ4, ρ5)
1 (a) 1,0,0,0,0 Pr(W=1)=ρ1=0.3
2 (i) 0,1,0,0,0 Pr(W=2)=ρ2=0.1
3 (u) 0,0,1,0,0 Pr(W=3)=ρ3=0.2
4 (e) 0,0,0,1,0 Pr(W=4)=ρ4=0.1
5 (o) 0,0,0,0,1 Pr(W=5)=ρ5=0.3
( )
∏
=
=
Kk
w k
k
W
p
1
Example
•
A vowel recognition system based on
• Gaussian distribution based acoustic model • Categorical distribution based language model
{
}
(
)
{
}
(
Σ
)
∏
=
=
∈ ∈ k w k W W o e u i a W o e u i a W kO
N
O
W
P
W
ρ
µ
,
|
max
arg
|
max
arg
ˆ
, , , , , , , ,Isolated Phone Recognition
•
O
:
A
sequence
of feature vectors of a segment of a
phone utterance
•
W
:
One of the phones in a language
p
Example
Input feature vector sequence Output phone symbol
Differences from the Frame Level Vowel Recognition
•
Acoustic model
•
Input is a sequence of feature vectors
• Probability must be assigned for variable length sequence
•
For most phones, frequency pattern has a structure in
time axis
•
Language model
•
The number of phones is larger than the number of
vowels. (But actually it’s not a big difference)
• # Vowels = 5 (for Japanese) • # Phones ~ 40 (for Japanese)
•
Categorical distribution can be used similarly to the
Hidden Markov Model (HMM)
•
A probabilistic model for sequential data
•
Defined by a set of states, state transition
probabilities, and state emission probabilities
1 2
o
( )
o pS=1o
( )
o pS=20.8 0.2
0.4
0.6 Example of HMM
having 2 states
Emission probability
HMM based Acoustic Model
•
Popular HMM design for acoustic modeling
• Left-to-right transition structure • Non-emitting start and end states
1
3
40
2
Example of 3 state (N=3) left-to-right HMM
1.0
0.8
0.2
0.7
0.3
0.9
0.1
( )
O P(
s s) (
P o s) (
P s s)
s s N s{
N}
P t t Fin T Fin t
T
t
t | 1 | | , 0 = 0, = +1, ∈ 1,2,,
=
∑ ∏
−Example
•
An isolated phone recognition system based on
• HMM based acoustic model
• Categorical distribution based language model
{
}
(
)
{
}
(
)
∏
∈ ∈=
=
k w k W phones W phones W kO
HMM
O
W
P
W
ρ
λ
|
max
arg
|
max
arg
ˆ
Modeled by a set of HMMs prepared for each W
Modeled by a
categorical distribution All parameters of an HMM: Transition probabilities and
Isolated Word Recognition
•
O
:
A sequence of feature vectors of a segment of a
word
utterance
•
W
:
One of a word in a vocabulary
hello
Example
Input feature
Difference from the Isolated Phone Recognition
•
Basically no difference
• But the number of words can be much larger than the
number of phones
E.g. # phones ~ 40, # words ~10000
•
Acoustic modeling (There are two popular strategies)
• Word HMM:
prepare an HMM for each word
• Phone based HMM:
prepare a phone HMM set, and compose a word HMM from the phone HMMs
•
Language modeling
Phone based Word Modeling
•
When # of words is large, preparing an HMM for each
word is difficult since # of parameters increases
•
Phone based modeling composes arbitral word models
by concatenating phone models
/HH/
hello
/AH/
/OW/
hero
Example
•
An isolated word recognition system based on
• HMM based acoustic model
• Categorical distribution based language model
{
}
(
)
{
}
(
)
∏
∈ ∈=
=
k w k W words W words W kO
HMM
O
W
P
W
ρ
λ
|
max
arg
|
max
arg
ˆ
Modeled by a set of HMMs prepared for each W
Modeled by a
Continuous Word Recognition
•
O
:
A sequence of feature vectors of an utterance
•
W
:
Arbitral word sequence of an utterance
I want to walk
Example
Input feature Output word
Difference from the Isolated Word Recognition
•
Drastic increase in complexity
• Utterance length (# of words in an utterance) L is unknown • # categories to recognize is exponential to L, i.e., ��,
where � is the vocabulary size
•
Acoustic modeling
• Phone based HMM:
prepare a phone HMM set, and compose an utterance HMM from the phone HMMs
•
Language modeling
• Need to assign probability to word sequences of arbitral
length
N-gram Model
• Assumes that the appearance of a word in an utterance depends
on at max N-1 preceding words as context
• Represented by a set of categorical distributions prepared for
each context
(
)
( ) (
) (
) (
)
(
)
( ) (
) (
) (
)
(
)
1 2 1 3 2 1 4 2 1 3 1 2 1 3 2 1 | | | | − ≈= T T
T w w w w P w w w w P w w w P w w P w P w w w w P N-gram approximation
Ignore history (or context) older than N-1 words
Unigram
•
N=1
• Do not consider the history at all
• Same as the product of individual word probabilities
(
)
( ) ( ) ( ) ( )
( )
( )
∏
==
≈
T t t T Tw
P
w
P
w
P
w
P
w
P
w
P
w
w
w
w
P
1 4 3 2 1 3 2 1
P(“Today is a sunny day”) =
P(“today”)P(“is”)P(“a”)P(“sunny”)P(“day”)
Bi-gram
•
N=2
• Consider only a previous word as the history
P(“Today is a sunny day”) =
P(today)P(is|today)P(a|is)P(sunny|a)P(day|sunny)
Example
:
Tri-gram
•
N=3
• Consider two previous words as the history • Popular in speech recognition
P(“Today is a sunny day”) =
P(today)P(is|today)P(a|today, is)P(sunny|is, a)P(day|a, sunny)
Example
:
Example
•
An continuous word recognition system based on
• HMM based acoustic model • N-gram based language model
{
}
(
)
{
}
HMM
(
O
)
Ngram
( )
W
O
W
P
W
W utterances
W
utterances W
λ
|
max
arg
|
max
arg
ˆ
∈ ∈
=
=
Problem: How to Perform argmax?
•
For continuous word recognition, # utterance is
huge, even infinite if the utterance length
L
is not
bounded
• E.g. If the vocabulary size V is 10000, and the utterance
length L is 10, # utterances is 1000010 = 1040
•
Enumerating all the utterances is impossible!!!
{
}
HMM
(
O
)
Ngram
( )
W
W
Wutterances W
λ
|
max
arg
ˆ
∈
=
How to do this?
Exercise 1.1
Suppose W is a vowel and O is a MFCC feature vector.
Suppose that PAM(O|W) is an acoustic model and PLM(W) is a language model. Obtain a vowel �� that maximizes P(W|O) when the acoustic and language model log likelihoods are given as in the following table.
{ }
{
P
(
W
O
)
}
W
o e u i a W
|
max
arg
ˆ
, , , , ∈
=
Vowel V a i u e o
log(P(O|V)) -13.4 -10.5 -30.1 -15.2 -17.0 log(P(V)) -1.61 -2.30 -1.61 -1.39 -1.39 Log(P(O|V))
Exercise 1.2
•
The following table defines a Bi-gram P(Word|Context)
C\W today is sunny End
Start 0.6 0.1 0.2 0.1
today 0.1 0.5 0.3 0.1
is 0.1 0.1 0.7 0.1
sunny 0.1 0.1 0.2 0.6
Start today is sunny End
1.0x0.6x0.5x0.7x0.6=0.126
0.6 0.5 0.7 0.6
*P(Start)=1.0 Example :
1.0
Exercise 1.2 (Cont.)
•
Based on the bigram definition of the previous slide,
compute the probability of the following sentences
1) P(“Start today sunny today sunny End”)
= 0.6*0.3*0.1*0.3*0.6 = 0.00324
2) P(“Start today today sunny sunny End”)