Introduction to Automatic Speech Recognition

(1)

Speech and Language Processing

Lecture 1

Speech recognition based on GMM, HMM, and N-gram

Information and Communications Engineering Course

(2)

Lecture Plan (Shinozaki’s part)

1. Speech recognition based on GMM, HMM, and N-gram

2. Maximum likelihood estimation and EM algorithm

3. Bayesian network and Bayesian inference

4. Variational inference and sampling

5. Neural network based acoustic and language models

6. Weighted finite state transducer (WFST) and speech

decoding

(3)

Handouts

• All the materials are available at my home page:

(4)

Perception of Speech Sound

• Speech sound is very special for human. You may feel

it is completely different from non-speech sound

• However, it is still a sound made by a physical process

• Frequency and time pattern are important

Speech sound pronunciating “Onsei”

Cut the region near “n” and

(5)

Utterance Generation

Vocal cords

Trachea Mouth

Lungs

The air flows in the pipe The membrane vibrate and generates beep sound G(ω)

Transfer characteristics H(ω) is modulated by moving the mouth

(6)

Spectral Analysis

Fourier transform

Frequency (Hz)

0 1000 2000 3000 4000

Lo

g

Po

w

e

(7)

Spectral Envelope

Spectral envelope corresponds to H(ω) and contains

phonological information

(8)

Vowels and Spectral Envelopes

8

/a/ /i/

(9)

Experiment: Replacing the Sound Source

G

G(ω) H(ω)

1. Record a voice

2. Analyze the voice, and decompose it to the sound source G _and the transmission characteristics H

X(ω)

（original voice）

=

×

4. Replace the sound source G _{with another one}G’

5. Compute G’(ω) ×H(ω_{) and re-generate waveform}

E.g：sawtooth wave

(10)

Synthesized Voice Changing

G

10

Sawtooth G’ (100Hz)

( ) ( )

ω

H

ω

G

'

( ) ( )

ω

H

ω

G

'

( ) ( )

ω

H

ω

G

'

Music G’ (Acoustic11*)

*The music is from: http://maoudamashii.jokersounds.com/

( ) ( )

ω

H

ω

(11)

History of Speech Technology

1850 ₁₉₀₀ 1950 2000

1857

1876 Telephone

Radio broad cast

ENIAC 1946

Vocoder 1939 1800

Phonautograph 1791

Kempelen’s talking machine

1969 Internet

2011 Siri on iOS

1952

Speech recognizer (Digit recognition)

1990 www

1920s Radio Rex

(12)

(13)

Applications of Speech Recognition

• Smartphone

• Voice assistance

• Speech-to-speech translation

• Judge

• Speech retrieval system to support

citizen judge

• Television

• Automatic captioning system

• Car navigation

• Voice commands

• Toy robots

(14)

Feature Extraction

Extract useful information from the input signal in a

convenient form for pattern recognition

• Help improving pattern recognition performance

• Reduce unnecessary memory and processing costs

Feature extraction

Time

(15)

Mel-Frequency Cepstrum (MFC)

Mel-Scale Filter Bank

Windowing

|DFT|

Mel-Filter Bank

Log IDFT Liftering

波形の標本値系列

Speech sound

• Widely used features for speech

recognition

• Emulate perceptual scale of pitches

(16)

Typical Feature Extraction Process

16

16kHz sampling 16bit quantization

Window width: 32ms (=512samples/16kHz)

shift: 10ms

Feature sequence

Sequence of real valued vectors Rate=100Hz

A vector is called a “frame”

Time

(17)

Speech Decoding

O

_:

_{Input acoustic features (or a feature sequence)}

W

_{: A}

_{symbol (or a symbol sequence) to recognize}

e.g. phone, word, word sequence, etc.

O

Pattern

(18)

Statistical Speech Recognition

• Use probability distribution to model speech sounds

(

W

O

)

P

W

|

max

arg

ˆ

=

(19)

Acoustic Model and Language Model

• By using the Bayes’ theorem, the probability is

decomposed to two parts

• P(

O

) is independent of the maximization of

W

, and

can be ignored

(

)

(

) ( )

( )

(20)

Problem Settings

• Frame-wise vowel recognition

• O: feature vector of a single frame

• W: One of the vowels (For Japanese: a,i,u,e,o)

• Isolated phone recognition

• O: A sequence of feature vectors of a segment of

phone utterance

• W: One of the phones

• Isolated word recognition

• O: A sequence of feature vectors of a segment of

word utterance

• W: One of the words in a vocabulary

• Continuous word recognition

(21)

Frame-wise Vowel Recognition

(

)

(

) ( )

( ) ( )

O

P

W

P

W

P

W

O

P

O

W

P

W

W W W W

max

arg

|

max

arg

|

max

arg

ˆ

=

W

={a,i,u,e,o}

O:

a feature vector

The conditional probability can be modeled by

preparing P_w(O) separately Time a i u i

(22)

Acoustic and Language Modeling

• Acoustic model (AM)

• Language model (LM)

( )

O

P

_W

( )

W

P

Probability distribution of O given W

Probability distribution of W Ex.

Gaussian distribution,

Gaussian mixture model, etc.

Ex.

(23)

Gaussian Distribution

• Defined by two parameters mean

μ

and standard

deviation

σ

(

σ

2

_{is variance)}

(

)

(

)

     ₋ ₋ = 2 2 2 2 2 1 exp 2 1 , | µ σ πσ σ µ x x N

It satisfies:

(

)

∫

∞

(

)

∞

−

=

<

|

,

|

,

1

0 N

x

µ

σ

2

N

x

µ

σ

2

dx

0 0.2 0.4 0.6 0 0.2 0.4 0.6

( )X

(24)

Multivariate Gaussian Distribution

• For D-dimensional vector

x

, it is defined using a

mean vector

μ

and a covariance matrix

Σ

:

24

(

)

( )

(

) (

)













₋

=

_x

_μ

_S

−

_x

_μ

S

μ

x

1

2

1 exp

|

2

1 ,

|

T

D

N

π

|S| denotes determinant of S

Contour plot of an example of a two dimensional Gaussian distribution

(25)

Gaussian Distribution based AM

• Fit a Gaussian distribution for each vowel

( )

(

)

(

)

      − − = = 2 2 2 2 2 1 exp 2 1 , | _w w w w w

w O N x x

P µ σ πσ σ µ a e i o

u We will consider this _{problem later in the} lecture of maximum

(26)

Gaussian Mixture Model (GMM)

• By mixing multiple Gaussian distributions, a complex

distribution can be expressed

 Useful to improve recognition performance

( )

=

_∑

(

)

i

i i

i

N

X

S

w

X

GMM

|

µ

,

i

w

i

N

: Mixture weight

: Component Gaussian distribution with

mean μ_i and covariance S_i

0 .

1

=

∑

=

M

m

k

(27)

Categorical Distribution

• The distribution is represented by a table

• The probability distribution of a skewed die is an

example of categorical distribution

Vowel a i u e o

(28)

1-of-K Representation

• The same probability as the table description can

be represented as an equation by using 1-of-K

representation

28

Value W

1-of-K representation W=(w_1, w_2, w_3, w_4, w₅)

Probability

ρ=(ρ1, ρ2, ρ3,ρ4, ρ5)

1 （a） 1,0,0,0,0 Pr(W=1)=ρ₁=0.3

2 （i） 0,1,0,0,0 Pr(W=2)=ρ₂=0.1

3 （u） 0,0,1,0,0 Pr(W=3)=ρ₃=0.2

4 （e） 0,0,0,1,0 Pr(W=4)=ρ₄=0.1

5 （o） 0,0,0,0,1 Pr(W=5)=ρ₅=0.3

( )

_∏

=

K

k

w k

k

W

p

1

(29)

Example

• A vowel recognition system based on

• Gaussian distribution based acoustic model • Categorical distribution based language model

{

}

(

)

{

}

(

Σ

)

∏

=

∈ ∈ k w k W W o e u i a W o e u i a W k

O

N

O

W

P

W

ρ

µ

,

|

max

arg

|

max

arg

ˆ

, , , , , , , ,

(30)

Isolated Phone Recognition

• O

:

A

sequence

of feature vectors of a segment of a

phone utterance

• W

:

One of the phones in a language

p

Example

Input feature vector sequence Output phone symbol

(31)

Differences from the Frame Level Vowel Recognition

• Acoustic model

• Input is a sequence of feature vectors

• Probability must be assigned for variable length sequence

• For most phones, frequency pattern has a structure in

time axis

• Language model

• The number of phones is larger than the number of

vowels. (But actually it’s not a big difference)

• # Vowels = 5 (for Japanese) • # Phones ~ 40 (for Japanese)

• Categorical distribution can be used similarly to the

(32)

Hidden Markov Model (HMM)

• A probabilistic model for sequential data

• Defined by a set of states, state transition

probabilities, and state emission probabilities

1 2

o

( )

o p_S₌₁

o

( )

o p_S₌₂

0.8 _0.2

0.4

0.6 Example of HMM

having 2 states

Emission probability

(33)

HMM based Acoustic Model

• Popular HMM design for acoustic modeling

• Left-to-right transition structure • Non-emitting start and end states

1 ₃

4

0

2

Example of 3 state (N=3) left-to-right HMM

1.0

0.8

0.2

0.7

0.3

0.9

0.1

( )

O P

(

s s

) (

P o s

) (

P s s

)

s s N s

{

N

}

P _t _t _Fin _T _Fin _t

T

t

t | 1 |  | , 0 = 0, = +1, ∈ 1,2,,

 



=

∑ ∏

₋

(34)

Example

• An isolated phone recognition system based on

• HMM based acoustic model

• Categorical distribution based language model

{

}

(

)

{

}

(

)

∏

∈ ∈

=

k w k W phones W phones W k

O

HMM

O

W

P

W

ρ

λ

|

max

arg

|

max

arg

ˆ

Modeled by a set of HMMs prepared for each W

Modeled by a

categorical distribution All parameters of an HMM: Transition probabilities and

(35)

Isolated Word Recognition

• O

:

A sequence of feature vectors of a segment of a

word

utterance

• W

:

One of a word in a vocabulary

hello

Example

Input feature

(36)

Difference from the Isolated Phone Recognition

• Basically no difference

• But the number of words can be much larger than the

number of phones

E.g. # phones ~ _{40, # words}~₁₀₀₀₀

• Acoustic modeling (There are two popular strategies)

• Word HMM:

prepare an HMM for each word

• Phone based HMM:

prepare a phone HMM set, and compose a word HMM from the phone HMMs

• Language modeling

(37)

Phone based Word Modeling

• When # of words is large, preparing an HMM for each

word is difficult since # of parameters increases

• Phone based modeling composes arbitral word models

by concatenating phone models

/HH/

hello

/AH/

/OW/

hero

(38)

Example

• An isolated word recognition system based on

• HMM based acoustic model

• Categorical distribution based language model

{

}

(

)

{

}

(

)

∏

∈ ∈

=

k w k W words W words W k

O

HMM

O

W

P

W

ρ

λ

|

max

arg

|

max

arg

ˆ

Modeled by a set of HMMs prepared for each W

Modeled by a

(39)

Continuous Word Recognition

• O

:

A sequence of feature vectors of an utterance

• W

:

Arbitral word sequence of an utterance

I want to walk

Example

Input feature Output word

(40)

Difference from the Isolated Word Recognition

• Drastic increase in complexity

• Utterance length (# of words in an utterance) L is unknown • # categories to recognize is exponential to L, i.e., ��,

where � is the vocabulary size

• Acoustic modeling

• Phone based HMM:

prepare a phone HMM set, and compose an utterance HMM from the phone HMMs

• Language modeling

• Need to assign probability to word sequences of arbitral

length

(41)

N-gram Model

• Assumes that the appearance of a word in an utterance depends

on at max N-1 preceding words as context

• Represented by a set of categorical distributions prepared for

each context

(

)

( ) (

) (

)

(

)

( ) (

) (

)

(

)

1 2 1 3 2 1 4 2 1 3 1 2 1 3 2 1 | | | | ₋ ≈

= _T _T

T w w w w P w w w w P w w w P w w P w P w w w w P    N-gram approximation

Ignore history (or context) older than N-1 words

(42)

Unigram

• N=1

• Do not consider the history at all

• Same as the product of individual word probabilities

(

)

( ) ( ) ( ) ( )

( )

∏

=

≈

T t t T T

w

P

w

P

w

P

w

P

w

P

w

P

w

P

1 4 3 2 1 3 2 1



P(“Today is a sunny day”) =

P(“today”)P(“is”)P(“a”)P(“sunny”)P(“day”)

(43)

Bi-gram

• N=2

• Consider only a previous word as the history

P(today)P(is|today)P(a|is)P(sunny|a)P(day|sunny)

Example

：

(44)

Tri-gram

• N=3

• Consider two previous words as the history • Popular in speech recognition

P(today)P(is|today)P(a|today, is)P(sunny|is, a)P(day|a, sunny)

Example

：

(45)

Example

• An continuous word recognition system based on

• HMM based acoustic model • N-gram based language model

{

}

(

)

{

}

HMM

(

O

)

Ngram

( )

W

O

W

P

W

W utterances

W

utterances W

λ

|

max

arg

|

max

arg

ˆ

∈ ∈

=

(46)

Problem: How to Perform argmax?

• For continuous word recognition, # utterance is

huge, even infinite if the utterance length

L

is not

bounded

• E.g. If the vocabulary size V is 10000, and the utterance

length L is 10, # utterances is 1000010 = 1040

• Enumerating all the utterances is impossible!!!

{

}

HMM

(

O

)

Ngram

( )

W

_W

utterances W

λ

|

max

arg

ˆ

∈

=

How to do this?

(47)

Exercise 1.1

Suppose W is a vowel and O is a MFCC feature vector.

Suppose that P_AM(O|W) is an acoustic model and P_LM(W) is a language model. Obtain a vowel �� that maximizes P(W|O) when the acoustic and language model log likelihoods are given as in the following table.

{ }

{

P

(

W

O

)

}

W

o e u i a W

|

max

arg

ˆ

, , , , ∈

=

Vowel V a i u e o

log(P(O|V)) -13.4 -10.5 -30.1 -15.2 -17.0 log(P(V)) -1.61 -2.30 -1.61 -1.39 -1.39 Log(P(O|V))

(48)

Exercise 1.2

• The following table defines a Bi-gram P(Word|Context)

C＼W today is sunny End

Start 0.6 0.1 0.2 0.1

today 0.1 0.5 0.3 0.1

is 0.1 0.1 0.7 0.1

sunny 0.1 0.1 0.2 0.6

Start today is sunny End

1.0x0.6x0.5x0.7x0.6=0.126

0.6 0.5 0.7 0.6

*P(Start)=1.0 Example ：

1.0

(49)

Exercise 1.2 (Cont.)

• Based on the bigram definition of the previous slide,

compute the probability of the following sentences

1) P(“Start today sunny today sunny End”)

= 0.6*0.3*0.1*0.3*0.6 = 0.00324

2) P(“Start today today sunny sunny End”)