• 検索結果がありません。

Introduction to Automatic Speech Recognition

N/A
N/A
Protected

Academic year: 2018

シェア "Introduction to Automatic Speech Recognition"

Copied!
49
0
0

読み込み中.... (全文を見る)

全文

(1)

Speech and Language Processing

Lecture 1

Speech recognition based on GMM, HMM, and N-gram

Information and Communications Engineering Course

(2)

Lecture Plan (Shinozaki’s part)

1. Speech recognition based on GMM, HMM, and N-gram

2. Maximum likelihood estimation and EM algorithm

3. Bayesian network and Bayesian inference

4. Variational inference and sampling

5. Neural network based acoustic and language models

6. Weighted finite state transducer (WFST) and speech

decoding

(3)

Handouts

All the materials are available at my home page:

(4)

Perception of Speech Sound

Speech sound is very special for human. You may feel

it is completely different from non-speech sound

However, it is still a sound made by a physical process

Frequency and time pattern are important

Speech sound pronunciating “Onsei”

Cut the region near “n” and

(5)

Utterance Generation

Vocal cords

Trachea Mouth

Lungs

The air flows in the pipe The membrane vibrate and generates beep sound G(ω)

Transfer characteristics H(ω) is modulated by moving the mouth

(6)

Spectral Analysis

Fourier transform

Frequency (Hz)

0 1000 2000 3000 4000

Lo

g

Po

w

e

(7)

Spectral Envelope

Spectral envelope corresponds to H(ω) and contains

phonological information

(8)

Vowels and Spectral Envelopes

8

/a/ /i/

(9)

Experiment: Replacing the Sound Source

G

G(ω) H(ω)

1. Record a voice

2. Analyze the voice, and decompose it to the sound source G and the transmission characteristics H

X(ω)

(original voice)

=

×

4. Replace the sound source G with another one G’

5. Compute G(ω) ×H(ω) and re-generate waveform

E.g:sawtooth wave

(10)

Synthesized Voice Changing

G

10

Sawtooth G’ (100Hz)

Sawtooth G’ (300Hz)

Sawtooth G’ (500Hz)

( ) ( )

ω

H

ω

G

'

( ) ( )

ω

H

ω

G

'

( ) ( )

ω

H

ω

G

'

Music G’ (Acoustic11*)

*The music is from: http://maoudamashii.jokersounds.com/

( ) ( )

ω

H

ω

(11)

History of Speech Technology

1850 1900 1950 2000

1857

1876 Telephone

Radio broad cast

ENIAC 1946

Vocoder 1939 1800

Phonautograph 1791

Kempelen’s talking machine

1969 Internet

2011 Siri on iOS

1952

Speech recognizer (Digit recognition)

1990 www

1920s Radio Rex

(12)
(13)

Applications of Speech Recognition

Smartphone

• Voice assistance

• Speech-to-speech translation

Judge

• Speech retrieval system to support

citizen judge

Television

• Automatic captioning system

Car navigation

• Voice commands

Toy robots

(14)

Feature Extraction

Extract useful information from the input signal in a

convenient form for pattern recognition

Help improving pattern recognition performance

Reduce unnecessary memory and processing costs

Feature extraction

Time

(15)

Mel-Frequency Cepstrum (MFC)

Mel-Scale Filter Bank

Windowing

|DFT|

Mel-Filter Bank

Log IDFT Liftering

波形の標本値系列

Speech sound

• Widely used features for speech

recognition

• Emulate perceptual scale of pitches

(16)

Typical Feature Extraction Process

16

16kHz sampling 16bit quantization

Window width: 32ms (=512samples/16kHz)

shift: 10ms

Feature sequence

Sequence of real valued vectors Rate=100Hz

A vector is called a “frame”

Time

(17)

Speech Decoding

O

:

Input acoustic features (or a feature sequence)

W

: A

symbol (or a symbol sequence) to recognize

e.g. phone, word, word sequence, etc.

O

Pattern

(18)

Statistical Speech Recognition

Use probability distribution to model speech sounds

(

W

O

)

P

W

W

|

max

arg

ˆ

=

(19)

Acoustic Model and Language Model

By using the Bayes’ theorem, the probability is

decomposed to two parts

P(

O

) is independent of the maximization of

W

, and

can be ignored

(

)

(

) ( )

( )

(20)

Problem Settings

Frame-wise vowel recognition

O: feature vector of a single frame

W: One of the vowels (For Japanese: a,i,u,e,o)

Isolated phone recognition

O: A sequence of feature vectors of a segment of

phone utterance

W: One of the phones

Isolated word recognition

O: A sequence of feature vectors of a segment of

word utterance

W: One of the words in a vocabulary

Continuous word recognition

(21)

Frame-wise Vowel Recognition

(

)

(

) ( )

( ) ( )

O

P

W

P

W

P

W

O

P

O

W

P

W

W W W W

max

arg

|

max

arg

|

max

arg

ˆ

=

=

=

W

={a,i,u,e,o}

O:

a feature vector

The conditional probability can be modeled by

preparing Pw(O) separately Time a i u i

(22)

Acoustic and Language Modeling

Acoustic model (AM)

Language model (LM)

( )

O

P

W

( )

W

P

Probability distribution of O given W

Probability distribution of W Ex.

Gaussian distribution,

Gaussian mixture model, etc.

Ex.

(23)

Gaussian Distribution

Defined by two parameters mean

μ

and standard

deviation

σ

(

σ

2

is variance)

(

)

(

)

      = 2 2 2 2 2 1 exp 2 1 , | µ σ πσ σ µ x x N

It satisfies:

(

)

(

)

=

<

|

,

,

|

,

1

0

N

x

µ

σ

2

N

x

µ

σ

2

dx

0 0.2 0.4 0.6 0 0.2 0.4 0.6

( )X

(24)

Multivariate Gaussian Distribution

For D-dimensional vector

x

, it is defined using a

mean vector

μ

and a covariance matrix

Σ

:

24

(

)

( )

(

) (

)

=

x

μ

S

x

μ

S

S

μ

x

1

2

1

exp

|

|

2

1

,

|

T

D

N

π

|S| denotes determinant of S

Contour plot of an example of a two dimensional Gaussian distribution

(25)

Gaussian Distribution based AM

Fit a Gaussian distribution for each vowel

( )

(

)

(

)

      − − = = 2 2 2 2 2 1 exp 2 1 , | w w w w w

w O N x x

P µ σ πσ σ µ a e i o

u We will consider this problem later in the lecture of maximum

(26)

Gaussian Mixture Model (GMM)

• By mixing multiple Gaussian distributions, a complex

distribution can be expressed

 Useful to improve recognition performance

( )

=

(

)

i

i i

i

i

N

X

S

w

X

GMM

|

µ

,

i

w

i

N

: Mixture weight

: Component Gaussian distribution with

mean μi and covariance Si

0

.

1

1

=

=

M

m

k

(27)

Categorical Distribution

The distribution is represented by a table

The probability distribution of a skewed die is an

example of categorical distribution

Vowel a i u e o

(28)

1-of-K Representation

The same probability as the table description can

be represented as an equation by using 1-of-K

representation

28

Value W

1-of-K representation W=(w1, w2, w3, w4, w5)

Probability

ρ=(ρ1, ρ2, ρ3,ρ4, ρ5)

1 (a) 1,0,0,0,0 Pr(W=1)=ρ1=0.3

2 (i) 0,1,0,0,0 Pr(W=2)=ρ2=0.1

3 (u) 0,0,1,0,0 Pr(W=3)=ρ3=0.2

4 (e) 0,0,0,1,0 Pr(W=4)=ρ4=0.1

5 (o) 0,0,0,0,1 Pr(W=5)=ρ5=0.3

( )

=

=

K

k

w k

k

W

p

1

(29)

Example

A vowel recognition system based on

• Gaussian distribution based acoustic model • Categorical distribution based language model

{

}

(

)

{

}

(

Σ

)

=

=

∈ ∈ k w k W W o e u i a W o e u i a W k

O

N

O

W

P

W

ρ

µ

,

|

max

arg

|

max

arg

ˆ

, , , , , , , ,

(30)

Isolated Phone Recognition

O

:

A

sequence

of feature vectors of a segment of a

phone utterance

W

:

One of the phones in a language

p

Example

Input feature vector sequence Output phone symbol

(31)

Differences from the Frame Level Vowel Recognition

Acoustic model

Input is a sequence of feature vectors

• Probability must be assigned for variable length sequence

For most phones, frequency pattern has a structure in

time axis

Language model

The number of phones is larger than the number of

vowels. (But actually it’s not a big difference)

• # Vowels = 5 (for Japanese) • # Phones ~ 40 (for Japanese)

Categorical distribution can be used similarly to the

(32)

Hidden Markov Model (HMM)

A probabilistic model for sequential data

Defined by a set of states, state transition

probabilities, and state emission probabilities

1 2

o

( )

o pS=1

o

( )

o pS=2

0.8 0.2

0.4

0.6 Example of HMM

having 2 states

Emission probability

(33)

HMM based Acoustic Model

Popular HMM design for acoustic modeling

• Left-to-right transition structure • Non-emitting start and end states

1

3

4

0

2

Example of 3 state (N=3) left-to-right HMM

1.0

0.8

0.2

0.7

0.3

0.9

0.1

( )

O P

(

s s

) (

P o s

) (

P s s

)

s s N s

{

N

}

P t t Fin T Fin t

T

t

t | 1 |  | , 0 = 0, = +1, ∈ 1,2,,

 

=

∑ ∏

(34)

Example

An isolated phone recognition system based on

• HMM based acoustic model

• Categorical distribution based language model

{

}

(

)

{

}

(

)

∈ ∈

=

=

k w k W phones W phones W k

O

HMM

O

W

P

W

ρ

λ

|

max

arg

|

max

arg

ˆ

Modeled by a set of HMMs prepared for each W

Modeled by a

categorical distribution All parameters of an HMM: Transition probabilities and

(35)

Isolated Word Recognition

O

:

A sequence of feature vectors of a segment of a

word

utterance

W

:

One of a word in a vocabulary

hello

Example

Input feature

(36)

Difference from the Isolated Phone Recognition

Basically no difference

• But the number of words can be much larger than the

number of phones

E.g. # phones ~ 40, # words ~10000

Acoustic modeling (There are two popular strategies)

• Word HMM:

prepare an HMM for each word

• Phone based HMM:

prepare a phone HMM set, and compose a word HMM from the phone HMMs

Language modeling

(37)

Phone based Word Modeling

When # of words is large, preparing an HMM for each

word is difficult since # of parameters increases

Phone based modeling composes arbitral word models

by concatenating phone models

/HH/

hello

/AH/

/OW/

hero

(38)

Example

An isolated word recognition system based on

• HMM based acoustic model

• Categorical distribution based language model

{

}

(

)

{

}

(

)

∈ ∈

=

=

k w k W words W words W k

O

HMM

O

W

P

W

ρ

λ

|

max

arg

|

max

arg

ˆ

Modeled by a set of HMMs prepared for each W

Modeled by a

(39)

Continuous Word Recognition

O

:

A sequence of feature vectors of an utterance

W

:

Arbitral word sequence of an utterance

I want to walk

Example

Input feature Output word

(40)

Difference from the Isolated Word Recognition

Drastic increase in complexity

• Utterance length (# of words in an utterance) L is unknown • # categories to recognize is exponential to L, i.e., ��,

where � is the vocabulary size

Acoustic modeling

• Phone based HMM:

prepare a phone HMM set, and compose an utterance HMM from the phone HMMs

Language modeling

• Need to assign probability to word sequences of arbitral

length

(41)

N-gram Model

• Assumes that the appearance of a word in an utterance depends

on at max N-1 preceding words as context

• Represented by a set of categorical distributions prepared for

each context

(

)

( ) (

) (

) (

)

(

)

( ) (

) (

) (

)

(

)

1 2 1 3 2 1 4 2 1 3 1 2 1 3 2 1 | | | |

= T T

T w w w w P w w w w P w w w P w w P w P w w w w P    N-gram approximation

Ignore history (or context) older than N-1 words

(42)

Unigram

N=1

• Do not consider the history at all

• Same as the product of individual word probabilities

(

)

( ) ( ) ( ) ( )

( )

( )

=

=

T t t T T

w

P

w

P

w

P

w

P

w

P

w

P

w

w

w

w

P

1 4 3 2 1 3 2 1

P(“Today is a sunny day”) =

P(“today”)P(“is”)P(“a”)P(“sunny”)P(“day”)

(43)

Bi-gram

N=2

• Consider only a previous word as the history

P(“Today is a sunny day”) =

P(today)P(is|today)P(a|is)P(sunny|a)P(day|sunny)

Example

(44)

Tri-gram

N=3

• Consider two previous words as the history • Popular in speech recognition

P(“Today is a sunny day”) =

P(today)P(is|today)P(a|today, is)P(sunny|is, a)P(day|a, sunny)

Example

(45)

Example

An continuous word recognition system based on

• HMM based acoustic model • N-gram based language model

{

}

(

)

{

}

HMM

(

O

)

Ngram

( )

W

O

W

P

W

W utterances

W

utterances W

λ

|

max

arg

|

max

arg

ˆ

∈ ∈

=

=

(46)

Problem: How to Perform argmax?

For continuous word recognition, # utterance is

huge, even infinite if the utterance length

L

is not

bounded

• E.g. If the vocabulary size V is 10000, and the utterance

length L is 10, # utterances is 1000010 = 1040

Enumerating all the utterances is impossible!!!

{

}

HMM

(

O

)

Ngram

( )

W

W

W

utterances W

λ

|

max

arg

ˆ

=

How to do this?

(47)

Exercise 1.1

Suppose W is a vowel and O is a MFCC feature vector.

Suppose that PAM(O|W) is an acoustic model and PLM(W) is a language model. Obtain a vowel �� that maximizes P(W|O) when the acoustic and language model log likelihoods are given as in the following table.

{ }

{

P

(

W

O

)

}

W

o e u i a W

|

max

arg

ˆ

, , , , ∈

=

Vowel V a i u e o

log(P(O|V)) -13.4 -10.5 -30.1 -15.2 -17.0 log(P(V)) -1.61 -2.30 -1.61 -1.39 -1.39 Log(P(O|V))

(48)

Exercise 1.2

The following table defines a Bi-gram P(Word|Context)

CW today is sunny End

Start 0.6 0.1 0.2 0.1

today 0.1 0.5 0.3 0.1

is 0.1 0.1 0.7 0.1

sunny 0.1 0.1 0.2 0.6

Start today is sunny End

1.0x0.6x0.5x0.7x0.6=0.126

0.6 0.5 0.7 0.6

*P(Start)=1.0 Example :

1.0

(49)

Exercise 1.2 (Cont.)

Based on the bigram definition of the previous slide,

compute the probability of the following sentences

1) P(“Start today sunny today sunny End”)

= 0.6*0.3*0.1*0.3*0.6 = 0.00324

2) P(“Start today today sunny sunny End”)

参照

関連したドキュメント

“haikai with a seasonal word” in Brazilian haikai, and the Portuguese chronicle as an example of authenticity in international haiku.. Masuda argued that a haikai that

Recognition process with a laser-assisted range sensor(B) 3.1 Principle of coil profile measurement This system is only appii~ble fm the case where the coils are all

In order to estimate the noise spectrum quickly and accurately, a detection method for a speech-absent frame and a speech-present frame by using a voice activity detector (VAD)

In this paper, we will apply these methods to the study of the representation theory for quadratic algebras generated by second-order superintegrable systems in 2D and their

In this study, a new metering method is presented based on homogeneous and separated flow theory; the acceleration pressure drop and the friction pressure drop of Venturi

In particular, we show that the q-heat polynomials and the q-associated functions are closely related to the discrete q-Hermite I polynomials and the discrete q-Hermite II

A conformal spin structure of signature (2, 2) is locally induced by a 2- dimensional projective structure via the Fefferman-type construction if and only if any of the

Review of Lawson homology and related theories Suslin’s Conjecture Correspondences Beilinson’s Theorem More on Suslin’s (strong) conjeture.. An Introduction to Lawson