A Statistical Lexicon Based on HMMs

(1)

A STATISTICAL LEXICON BASED ON HMMS

RainerGruhn, Satoshi Nakamura

ATRSp oken Language Translation Res. Labs.

2{2{2 Hikaridai,Keihanna Gakkentoshi, Kyoto 619-0288, Japan

[email protected]

ABSTRACT

This paper introduces a novel approach towards

pronunciation mo deling for pronunciation rescoring.

Ratherthanexplicitlyrepresentingpronunciationvari-

ations,adiscreteHMMisprovidedforeachword,mo-

deling seen and allowing unseen pronunciation varia-

tions. Phonesubstitutions,deletionsandinsertionsare

equally covered. The approach is evaluated on non-

nativesp eakerssp eechrecognitiontask.

1. INTRODUCTION

Alot of workhas b een rep orted ab out pronunciation

modeling[1]. Manyapproachesfollowthesimilarbasic

schemeofcomparingmanualorautomaticallygenerated

phonemetranscriptions tosomebaselinetranscription.

Variationinformationcanb eextractedfromthedier-

ences. Typicallyit is representedinthe formof rules,

which canb e weighted based on o ccurence frequency,

likelihoo d,confusabilityorothermeasures. Theserules

are applied to a baseline lexicon in order to generate

someadaptedlexiconortooptimizeanacousticmo del.

Unfortunately this approach usually brings only little

improvement.

Inthis research, wesuggesta newdata-drivenap-

proachtodealwithpronunciationvariations. Itisbased

onword-levelpronunciationHMMs,which areapplied

torescoren-b esthyp otheses. Ourtargetis toimprove

thep erformanceofacontinuoussp eechrecognitionsys-

temonachallengingsp eakergroupsuchasnon-native

speakers.

Similar to the standard approach, we generate a

phonetictranscriptionwithphonemerecognizer. These

phoneme sequences are used as training data for dis-

cretewordHMMs; oneHMMforeachword. There is

noattempt to explicitly represent the phonemevaria-

tions. Evenvariationsunseen inthetraining dataare

allowed,asacertaino orprobabilityexistsforallp ossi-

blephonemesequencesforeachword. TheHMMtrain-

ingprocesswillimplicitlytakecareofallvariation-and

likelihoo d issues, unlike in otherapproaches, e.g. rule

ringfrequenciesdonothavetob ecalculated.

2. WORDHMMS

AsillustratedinFig.1,twolevelsofHMM-basedrecog-

nitionareinvolvedinthisapproach:

Acousticlevel: phonemerecognition togenerate

the phonemesequence Si fromthe acoustic fea-

turesO

i

Phoneme lab ellevel: Fortraining, thephoneme

sequences Si are considered as input. For all

words,adiscretewordHMMistrainedonallin-

stances of that word in the training data. The

acoustic feature vectors

phonemes o ₁ o ₂ o ₃ o ₄ o ₅ o ₆ o ₇ o ₈ o ₉ o ₁₀ o ₁₁

phoneme recognition to generate phoneme sequences

train discrete HMM for each word on all instances of that word

s 1 s

2 s

1 s

3 o ₁₂

s 4 s

3 w _i ¹ w _i ² w _i ³

Figure1: Twolayers ofHMMsarerequiredtogenerate

pronunciationvariantsandtheir likelihoods: anacous-

ticlevelforphonemerecognitionandthephonemelabel

levelforwordmodeltraining.

mo delsareappliedforrescoring,generatingapro-

nunciationscoregiventheobservedphonemese-

quenceSiandthewordsequence.

The rst step requires a standard HMM acoustic

mo del, andpreferablysome phonemebigram language

mo delasphonotacticconstraint.Thecontinuoustrain-

ing sp eech data is segmented to word chunks based

on time information generated by viterbi alignment.

Acoustic feature vectorsare deco ded to an 1-b est se-

quenceofphonemes.

Foreachwordinthevocabulary,onediscreteuntied

HMMisgenerated. Figure2showsanexampleforthe

word\and".

Enter ae .49 ax .49 ...

n .99 ...

d .99 ...

Exit

Figure2: AnexamplediscretewordHMMfortheword

\and",initializedwithtwopronunciationvariationsfor

therstphoneme.

Themo delsareinitializedonthephonemesequence

in some baseline pronunciation lexicon. The number

of states fora wordmo del is set to b e thenumber of

phonemesinthebaselinepronunciation,plusenterand

exitstates. Eachstatehasadiscreteprobabilitydistri-

bution ofallphonemes,giving thebaselinephonemea

high probabilityand allotherphonemessome lowbut

non-zerovalue. Forwardtransitionb etweenallstatesis

allowed,withinitialtransition probabilitiesfavouringa

paththathitseachstateonce.

2−37

4L-1 情報処理学会第66回全国大会

(2)

probabilitiesarereestimatedonthephonemesequences

ofthetrainingdata. Foreachword,allinstancesinthe

trainingdataarecollected and analyzed. The number

ofstatesofeachwordmo delremains static. Phoneme

deletionsarecoveredbystateskiptransitions,phoneme

insertionsaremo deledbystateself-lo optransitions.

Datasparsenessis acommonproblemforautoma-

tically trained pronunciation mo deling algorithms. In

thisapproach,pronunciationsforwordsthatdoappear

sucientlyfrequentinthetrainingdata,thepronunci-

ationsare generatedinadata-drivenmanner. Forrare

words,thealgorithmfallsbackonbaselinephonemese-

quencesfromagivenlexicon. Thiscombinationshould

make it more robust thanfor example anapplication

ofphonemeconfusionrulesonalexicon(ase.g.in[2])

could.

3. EXPERIMENTS

3.1. Phonemerecognition

Forevaluation,weusedanon-nativedatabasecollected

atATRandconsisting of 11Japanesesp eakersof En-

glish.Ab out12minutesofreadsp eechareavailablep er

speaker,whichwasdividedintotenminutesfortraining

andtwominutesastestset. Thetaskdomnainishotel

reservation.

Thenon-nativetrainingdatasetis segmentedinto

single words based on time information aquired by

viterbi alignment. On these word chunks, phoneme

recognitionis p erformed. Toarchievehigher phoneme

recognition accuracy than with monophones, a right-

context biphone mo del is applied. In the resulting

phonemestring,thecontextisnotconsidered,though.

Thephoneme recognition accuracyfor the non-native

taskis34.68%relativetothecanonictranscription.The

biphoneacousticmo delinthisexp erimentistrainedon

theWallStreet Journal(WSJ) read sp eechcorpus [3]

Thephonemeset consistsof43phonemes plussilence.

In the second level of pro cessing, the rescoring, o c-

curences of silence are ignored. The HTK to olkit [4]

isusedforalltraininganddeco dingsteps.

3.2. Word HMMinitialization

The discrete probability distribution for each state is

initialized dep ending on the \correct" phoneme se-

quence(s)asgiveninthelexicon. Thecorrectphoneme

hasaprobabilityof0.99;ifmorethanonepronunciation

variantisincludedinthelexicon,thevariationsallhave

thesameprobability. Allotherphonemes areassigned

somenon-zeroprobability.

Thetransition probabilitiesdep endonthe number

of succeeding phonemes in the baseline lexicon. The

probability to skip k phonemes is initialized to 0:05 k

.

Insertionsareallowedwithachanceof0.05. Thetran-

sition to the next state therefore has a probability of

slightlyb elow0.9.

3.3. Rescoring

TheHMMpronunciationmo delsareappliedintheform

ofrescoringthen-b estdeco dingresult.Onanutterance

inthetestdata,b otha1-b estphonemerecognitionand

a standard n-b est recognition (on word level) is p er-

formed. For eachof the n-b est sequences, weapply a

forcedalignmentusingthediscretepronunciationmo d-

els, the phoneme sequence as input features and the

wordsequenceaslab els. Theresultingscoreisthepro-

nunciationscore.

weightedlanguagemo delscoreforthishyp othesis. The

hypothesisarchievingthehighesttotalscoreamongthe

n-b estisselectedascorrect. Figure3showsthep erfor-

mance for various language mo del weights. The b est

p erformance is 29.04% word error rate (WER) com-

pared to baseline p erformance in this exp eriment of

32.54%.

Figure3: Worderrorrate forrescoringof n-bestbased

onpronunciationscorecombinedwithweightedlanguage

modelscores.

4. CONCLUSION

Worderrorratecouldb eimprovedbyarelative10.8%

withpronunciationrescoring,showingthe eectiveness

oftheapproachfornon-nativesp eech. Thefullstrength

oftheapproachmaynotb eachievedinthisevaluation

b ecauseof lackof non-nativetrainingdata,which fre-

quently forcesword mo delsto defaultto the standard

pronunciations. Also,consideringtheacousticscoreto-

gether with pronunciation and language mo del score

couldb eahelpfulextension.

5. ACKNOWLEDGEMENT

The research rep orted here was supp orted in part by

a contractwiththe TelecommunicationsAdvancement

OrganizationofJapanentitled,"Astudyofsp eechdia-

loguetranslationtechnologybasedonalargecorpus".

6. REFERENCES

[1] HelmerStrikandCatiaCucchiarini,\Mo delingpro-

nunciationvariationforASR:Asurveyofthelitera-

ture," SpeechCommunication,vol.29,pp.225{246,

1999.

[2] Rainer Gruhn, Konstantin Markov, and Satoshi

Nakamura,\Probabilitysustainingphonemesubsti-

tution fornon-nativesp eechrecognition," inProc.

Acoust.Soc.Jap.,Fall2002,pp.195{196.

[3] D.B.PaulandJ.M.Baker, \Thedesignforthewall

streetjournalbasedCSRcorpus,"inProc.DARPA

Workshop,PacicGrove,CA,1992,pp.357{362.

[4] P.Woo dlandand S.Young, \TheHTK tied-state

continuoussp eechrecognizer," inProc.EuroSpeech,

1993, pp.2207{2210.

A Statistical Lexicon Based on HMMs

acoustic feature vectors

phonemes o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 o 9 o 10 o 11

phoneme recognition to generate phoneme sequences

train discrete HMM for each word on all instances of that word

s 1 s

2 s

1 s

3

o 12

s 4 s

3

w i 1 w i 2 w i 3