• 検索結果がありません。

Introduction to Automatic Speech Recognition

N/A
N/A
Protected

Academic year: 2018

シェア "Introduction to Automatic Speech Recognition"

Copied!
33
0
0

読み込み中.... (全文を見る)

全文

(1)

Speech and Language Processing

Lecture 6

Weighted finite state transducer (WFST) and speech decoding

Information and Communications Engineering Course

(2)

Lecture Plan (Shinozaki’s part)

1. 10/20 (remote)

Speech recognition based on GMM, HMM, and N-gram 2. 10/27 (remote)

Maximum likelihood estimation and EM algorithm 3. 11/3 (remote)

Bayesian network and Bayesian inference 4. 11/6 (@TAIST)

Variational inference and sampling 5. 11/7 (@TAIST)

Neural network based acoustic and language models

6. 11/7 (@TAIST)

(3)

Today’s Topic

• Answers for the previous exercises

• Weighted finite state transducer (WFST)

(4)
(5)

Exercise 5.1

• Let h be a softmax function having inputs z1, z2,…,zN.

• Prove that

( )

=

j j i i z z z h ) exp( ) exp(

( )

1.0

1

=

iN= h zi

( )

1

(6)

Exercise 5.2

When h(y) and y(x) are given as follows, obtain

( )

( )

y y h − + = exp 1 1 b ax

y = +

x h ∂ ∂

( ) (

)

(

( )

( )

)

(

)

(

)

(

)

(

)

(

ax b

)

a

(

h

(

ax b

)

) (

h ax b

)

(7)
(8)

Weighted Finite State Acceptor (WFSA)

• Defined by a set of nodes, arcs, arc labels, and arc weights

• An extension of finite state automaton

• Accepts a sequence of symbols (string) assigning a weight to the sequence

Weight Arc label

Node

(9)

Weight of Path (Tropical Weight)

a/w

1

b/w

2 Weight of accepting

“a,b” is :

w1 ⊗ w2 = w1 + w2

a/w

1

a/w

2

Weight of accepting “a” is :

(10)

Example

b

/0

.1

a/0.5

Initial

state Final

state

(11)

Variations of Weight Types

Name Set

Plus

Times

Zero One

Boolean {0,1} ∨ ∧ 0 1

Real {0, ∞} + * 0 1

Log {-∞, ∞} -log(e-x+e-y) +0

(12)

Weighted Finite State Transducer

• Defined by a set of nodes, arcs, input/output arc labels, and arc weights

• An extension of finite state acceptor

• Transduces an input string to an output string, assigning a weight

Weight Output

symbol Input

symbol

(13)

Example

b

:C

/0

.1

a:B/0.5

Initial

state Final

state

a, a, a, b  A, B, B, B (1.4)

(14)

Null

ε

Symbol

• Input symbol is ε: No input

• Output symbol is ε: No output

ε:a/0.5

C:ε/0.2

C:c/0.3

B:b/1.0 A:a/1.0

(15)

Invert of WFST

• Swap input symbol and output symbol

a1:b1/0.2

a1:b2/0.8

a2:b1/0.3

a2:b2/0.7

b1:a1/0.2

b2:a1/0.8

b1:a2/0.3

(16)

Equivalence of WFST

• Two transducers are equivalent if for each input string, they produce the same output strings with the same weight

a:A/0.5

a:ε/0.2 ε:A/0.3

(17)

Determinization of WFST

• Makes equivalent WFST so that no state has two transitions with the same input label

a:ε/1.0

WFST (A)

(18)

Exercise 6.1

(19)

Composition of WFSTs

• WFST1 transduces string x to string y with weight w1

• WFST2 transduces string y to string z with weight w2

• Composition of WFST1 and WFST2 (WFST1・WFST2)

makes WFST that transduces x to z with weight w1w2

a:A/0.4

ε:ε/0.6

b:B/1.0

B:い/0.2

WFST (A)

WFST (B)

WFST (C)

b:い/1.8

(20)

Exercise 6.2

• Check that WFST (C) in previous slide is equivalent to WFST (A) ・WFST(B) by enumerating all possible input

(21)

Probability Distribution and WFST

A

B

A=a1 A=a2

P(A) 0.4 0.6

P(B|A) B=b1 B=b2

A=a1 0.2 0.8

A=a2 0.3 0.7

a1:a1/0.4

a2:a2/0.6

a1:b1/0.2

a1:b2/0.8

a2:b1/0.3

wfstA

(22)

Composition and Joint Probability

a1:a1/0.4

a2:a2/0.6

WFSTB WFST A

A・B P(A,B)

∝P(B|A=a)

(23)
(24)

WFST Representation of N-gram

Uni-gram P(w)

S

ta

rt

:w

N

/P

(w

N

|

S

ta

rt

(25)

WFST Representation of HMM

a_s :a a_s :ε a_s1

/-log(0.8)

a_s :ε ε:ε

a_s2:ε /-log(0.7)

a_s3:ε /-log(0.9)

s

1

s

3 s5

s0

s

(26)

Exercise 6.3

• Make a WFST that represents the following bi-gram

CW today is End

Start 0.6 0.3 0.1

today 0.2 0.5 0.3

(27)

WFST To Recognize Phone Sequence

Input: Sequence of HMM state names

a_s1:a/1.0

a_s2:ε/0.2 a_s1:ε/0.8

a_s3:ε/0.3 ε:ε/0.1

a_s2:ε/0.7 a_s3:ε/0.9

End i_s1:i/1.0 i_s2:ε/0.2

i_s1:ε/0.8

i_s3:ε/0.3 ε:ε/0.1

i_s2:ε/0.7 i_s3:ε/0.9

u_s1:u/1.0

u_s2:ε/0.2 u_s1:ε/0.8

u_s3:ε/0.3 ε:ε/0.1

(28)

Construction of a Speech Recognition System

• Prepare:

• WFST representing phone HMMs

• WFST representing pronunciation dictionary: L

• WFST representing word network or N-gram: G

• Compose:

• WFST(Recognition system) =H ・ L ・ G

• The WFST H・L・G has input labels indicating a HMM state name (=emission distribution) and output labels of a word

Weight

Word or ε

(29)

Speech Decoding based on WFST

• Starting from the initial state, t-th transition with non-epsilon input label si

corresponds to t-th time frame xt in input acoustic feature sequence

• By computing acoustic probability and adding it to language probability, an WFSA is obtained

(

x

t

s

i

)

(30)

Search the Best Recognition Hypothesis

• By performing minimum cost path search on the obtained WFSA, a word sequence that best

matches to the input sound is obtained as the sequence of the output labels

(31)
(32)

Tools for WFST

• OpenFST

• http://www.openfst.org

• Graphviz

(33)

Example

0 1 <eps> a 0.5 0 1 C c 0.3

0 2 C <eps> 0.2 1 2 B b 1.0

2 3 A a 3 wfst1.txt <eps> 0 A 1 B 2 C 3 isym.txt <eps> 0 a 1 b 2 c 3 osym.txt

$ fstcompile --isymbols=isym.txt --osymbols=osym.txt --keep_isymbols --keep_osymbols wfst1.txt wfst1.fst

参照

関連したドキュメント

Recognition process with a laser-assisted range sensor(B) 3.1 Principle of coil profile measurement This system is only appii~ble fm the case where the coils are all

In order to estimate the noise spectrum quickly and accurately, a detection method for a speech-absent frame and a speech-present frame by using a voice activity detector (VAD)

Two Steiner triple systems (X, A) and (X, B) are said to intersect in m pairwise disjoint blocks if |A ∩ B| = m and all blocks in A ∩ B are pairwise disjoint... are said to intersect

The reader will check easily that the conjecture is valid for all cyclic groups of prime order (when it is equivalent to the Erd˝ os-Heilbronn conjecture); for the infinite cyclic

Definition 1 Given two piles, A and B, where #A ≤ #B and the number of to- kens in the respective pile is counted before the previous player’s move, then, if the previous player

If f (x, y) satisfies the Euler-Lagrange equation (5.3) in a domain D, then the local potential functions directed by any number of additional critical isosystolic classes

The Artin braid group B n has been extended to the singular braid monoid SB n by Birman [5] and Baez [1] in order to study Vassiliev invariants.. The strings of a singular braid

The fundamental input of the project is the recognition that Tomita–Takesaki modular theo- ry (the “heart” of equilibrium quantum statistical mechanics) can be reinterpreted as a way