• 検索結果がありません。

A Statistical Lexicon Based on HMMs

N/A
N/A
Protected

Academic year: 2022

シェア "A Statistical Lexicon Based on HMMs"

Copied!
2
0
0

読み込み中.... (全文を見る)

全文

(1)

A STATISTICAL LEXICON BASED ON HMMS

RainerGruhn, Satoshi Nakamura

ATRSp oken Language Translation Res. Labs.

2{2{2 Hikaridai,Keihanna Gakkentoshi, Kyoto 619-0288, Japan

[email protected]

ABSTRACT

This paper introduces a novel approach towards

pronunciation mo deling for pronunciation rescoring.

Ratherthanexplicitlyrepresentingpronunciationvari-

ations,adiscreteHMMisprovidedforeachword,mo-

deling seen and allowing unseen pronunciation varia-

tions. Phonesubstitutions,deletionsandinsertionsare

equally covered. The approach is evaluated on non-

nativesp eakerssp eechrecognitiontask.

1. INTRODUCTION

Alot of workhas b een rep orted ab out pronunciation

modeling[1]. Manyapproachesfollowthesimilarbasic

schemeofcomparingmanualorautomaticallygenerated

phonemetranscriptions tosomebaselinetranscription.

Variationinformationcanb eextractedfromthedier-

ences. Typicallyit is representedinthe formof rules,

which canb e weighted based on o ccurence frequency,

likelihoo d,confusabilityorothermeasures. Theserules

are applied to a baseline lexicon in order to generate

someadaptedlexiconortooptimizeanacousticmo del.

Unfortunately this approach usually brings only little

improvement.

Inthis research, wesuggesta newdata-drivenap-

proachtodealwithpronunciationvariations. Itisbased

onword-levelpronunciationHMMs,which areapplied

torescoren-b esthyp otheses. Ourtargetis toimprove

thep erformanceofacontinuoussp eechrecognitionsys-

temonachallengingsp eakergroupsuchasnon-native

speakers.

Similar to the standard approach, we generate a

phonetictranscriptionwithphonemerecognizer. These

phoneme sequences are used as training data for dis-

cretewordHMMs; oneHMMforeachword. There is

noattempt to explicitly represent the phonemevaria-

tions. Evenvariationsunseen inthetraining dataare

allowed,asacertaino orprobabilityexistsforallp ossi-

blephonemesequencesforeachword. TheHMMtrain-

ingprocesswillimplicitlytakecareofallvariation-and

likelihoo d issues, unlike in otherapproaches, e.g. rule

ringfrequenciesdonothavetob ecalculated.

2. WORDHMMS

AsillustratedinFig.1,twolevelsofHMM-basedrecog-

nitionareinvolvedinthisapproach:

Acousticlevel: phonemerecognition togenerate

the phonemesequence Si fromthe acoustic fea-

turesO

i

Phoneme lab ellevel: Fortraining, thephoneme

sequences Si are considered as input. For all

words,adiscretewordHMMistrainedonallin-

stances of that word in the training data. The

acoustic feature vectors

phonemes o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 o 9 o 10 o 11

phoneme recognition to generate phoneme sequences

train discrete HMM for each word on all instances of that word

s 1 s

2 s

1 s

3

o 12

s 4 s

3

w i 1 w i 2 w i 3

Figure1: Twolayers ofHMMsarerequiredtogenerate

pronunciationvariantsandtheir likelihoods: anacous-

ticlevelforphonemerecognitionandthephonemelabel

levelforwordmodeltraining.

mo delsareappliedforrescoring,generatingapro-

nunciationscoregiventheobservedphonemese-

quenceSiandthewordsequence.

The rst step requires a standard HMM acoustic

mo del, andpreferablysome phonemebigram language

mo delasphonotacticconstraint.Thecontinuoustrain-

ing sp eech data is segmented to word chunks based

on time information generated by viterbi alignment.

Acoustic feature vectorsare deco ded to an 1-b est se-

quenceofphonemes.

Foreachwordinthevocabulary,onediscreteuntied

HMMisgenerated. Figure2showsanexampleforthe

word\and".

Enter ae .49 ax .49 ...

n .99 ...

d .99 ...

Exit

Figure2: AnexamplediscretewordHMMfortheword

\and",initializedwithtwopronunciationvariationsfor

therstphoneme.

Themo delsareinitializedonthephonemesequence

in some baseline pronunciation lexicon. The number

of states fora wordmo del is set to b e thenumber of

phonemesinthebaselinepronunciation,plusenterand

exitstates. Eachstatehasadiscreteprobabilitydistri-

bution ofallphonemes,giving thebaselinephonemea

high probabilityand allotherphonemessome lowbut

non-zerovalue. Forwardtransitionb etweenallstatesis

allowed,withinitialtransition probabilitiesfavouringa

paththathitseachstateonce.

2−37

4L-1 情報処理学会第66回全国大会

(2)

probabilitiesarereestimatedonthephonemesequences

ofthetrainingdata. Foreachword,allinstancesinthe

trainingdataarecollected and analyzed. The number

ofstatesofeachwordmo delremains static. Phoneme

deletionsarecoveredbystateskiptransitions,phoneme

insertionsaremo deledbystateself-lo optransitions.

Datasparsenessis acommonproblemforautoma-

tically trained pronunciation mo deling algorithms. In

thisapproach,pronunciationsforwordsthatdoappear

sucientlyfrequentinthetrainingdata,thepronunci-

ationsare generatedinadata-drivenmanner. Forrare

words,thealgorithmfallsbackonbaselinephonemese-

quencesfromagivenlexicon. Thiscombinationshould

make it more robust thanfor example anapplication

ofphonemeconfusionrulesonalexicon(ase.g.in[2])

could.

3. EXPERIMENTS

3.1. Phonemerecognition

Forevaluation,weusedanon-nativedatabasecollected

atATRandconsisting of 11Japanesesp eakersof En-

glish.Ab out12minutesofreadsp eechareavailablep er

speaker,whichwasdividedintotenminutesfortraining

andtwominutesastestset. Thetaskdomnainishotel

reservation.

Thenon-nativetrainingdatasetis segmentedinto

single words based on time information aquired by

viterbi alignment. On these word chunks, phoneme

recognitionis p erformed. Toarchievehigher phoneme

recognition accuracy than with monophones, a right-

context biphone mo del is applied. In the resulting

phonemestring,thecontextisnotconsidered,though.

Thephoneme recognition accuracyfor the non-native

taskis34.68%relativetothecanonictranscription.The

biphoneacousticmo delinthisexp erimentistrainedon

theWallStreet Journal(WSJ) read sp eechcorpus [3]

Thephonemeset consistsof43phonemes plussilence.

In the second level of pro cessing, the rescoring, o c-

curences of silence are ignored. The HTK to olkit [4]

isusedforalltraininganddeco dingsteps.

3.2. Word HMMinitialization

The discrete probability distribution for each state is

initialized dep ending on the \correct" phoneme se-

quence(s)asgiveninthelexicon. Thecorrectphoneme

hasaprobabilityof0.99;ifmorethanonepronunciation

variantisincludedinthelexicon,thevariationsallhave

thesameprobability. Allotherphonemes areassigned

somenon-zeroprobability.

Thetransition probabilitiesdep endonthe number

of succeeding phonemes in the baseline lexicon. The

probability to skip k phonemes is initialized to 0:05 k

.

Insertionsareallowedwithachanceof0.05. Thetran-

sition to the next state therefore has a probability of

slightlyb elow0.9.

3.3. Rescoring

TheHMMpronunciationmo delsareappliedintheform

ofrescoringthen-b estdeco dingresult.Onanutterance

inthetestdata,b otha1-b estphonemerecognitionand

a standard n-b est recognition (on word level) is p er-

formed. For eachof the n-b est sequences, weapply a

forcedalignmentusingthediscretepronunciationmo d-

els, the phoneme sequence as input features and the

wordsequenceaslab els. Theresultingscoreisthepro-

nunciationscore.

weightedlanguagemo delscoreforthishyp othesis. The

hypothesisarchievingthehighesttotalscoreamongthe

n-b estisselectedascorrect. Figure3showsthep erfor-

mance for various language mo del weights. The b est

p erformance is 29.04% word error rate (WER) com-

pared to baseline p erformance in this exp eriment of

32.54%.

Figure3: Worderrorrate forrescoringof n-bestbased

onpronunciationscorecombinedwithweightedlanguage

modelscores.

4. CONCLUSION

Worderrorratecouldb eimprovedbyarelative10.8%

withpronunciationrescoring,showingthe eectiveness

oftheapproachfornon-nativesp eech. Thefullstrength

oftheapproachmaynotb eachievedinthisevaluation

b ecauseof lackof non-nativetrainingdata,which fre-

quently forcesword mo delsto defaultto the standard

pronunciations. Also,consideringtheacousticscoreto-

gether with pronunciation and language mo del score

couldb eahelpfulextension.

5. ACKNOWLEDGEMENT

The research rep orted here was supp orted in part by

a contractwiththe TelecommunicationsAdvancement

OrganizationofJapanentitled,"Astudyofsp eechdia-

loguetranslationtechnologybasedonalargecorpus".

6. REFERENCES

[1] HelmerStrikandCatiaCucchiarini,\Mo delingpro-

nunciationvariationforASR:Asurveyofthelitera-

ture," SpeechCommunication,vol.29,pp.225{246,

1999.

[2] Rainer Gruhn, Konstantin Markov, and Satoshi

Nakamura,\Probabilitysustainingphonemesubsti-

tution fornon-nativesp eechrecognition," inProc.

Acoust.Soc.Jap.,Fall2002,pp.195{196.

[3] D.B.PaulandJ.M.Baker, \Thedesignforthewall

streetjournalbasedCSRcorpus,"inProc.DARPA

Workshop,PacicGrove,CA,1992,pp.357{362.

[4] P.Woo dlandand S.Young, \TheHTK tied-state

continuoussp eechrecognizer," inProc.EuroSpeech,

1993, pp.2207{2210.

2−38

参照

関連したドキュメント

To make it possible to generate more natural excitation sounds, we have proposed a method to automatically control the fundamental frequency of the sounds generated by the

Dynamic Protocol Developing Based on Bidding Information In this section, we consider a dynamic protocol developing method by using auction information.. Also we treat

In the future IoT (Internet of Things), advanced ICT services that control physical things accord- ing to the analysis results of collected real-time data generated by a large

Therefore, mor巴 us巴ful information for estimating natural Fo contours is available when converting the EL speech g巴nerated with the air-pressure sensor ELair sp巴巴ch to

The Method Based on Trigram Trigram data showed better accuracy rates than unigram and bigram in ten corpora.. Trigram can evaluate the areas’ specific sequences of characters

Our security solution based on PCTL provides functions as follows: (1) On-demand inquiries about real time delegation information of grid computing underway; (2)

By using this model, we studied a number of tree-to-string phrase-based SMT approaches which vary in the way syntactic information is used including preprocessing and decoding

By using this model, we studied a number of tree-to-string phrase-based SMT approaches which vary in the way syntactic information is used including preprocessing and decoding