A STATISTICAL LEXICON BASED ON HMMS
RainerGruhn, Satoshi Nakamura
ATRSp oken Language Translation Res. Labs.
2{2{2 Hikaridai,Keihanna Gakkentoshi, Kyoto 619-0288, Japan
ABSTRACT
This paper introduces a novel approach towards
pronunciation mo deling for pronunciation rescoring.
Ratherthanexplicitlyrepresentingpronunciationvari-
ations,adiscreteHMMisprovidedforeachword,mo-
deling seen and allowing unseen pronunciation varia-
tions. Phonesubstitutions,deletionsandinsertionsare
equally covered. The approach is evaluated on non-
nativesp eakerssp eechrecognitiontask.
1. INTRODUCTION
Alot of workhas b een rep orted ab out pronunciation
modeling[1]. Manyapproachesfollowthesimilarbasic
schemeofcomparingmanualorautomaticallygenerated
phonemetranscriptions tosomebaselinetranscription.
Variationinformationcanb eextractedfromthedier-
ences. Typicallyit is representedinthe formof rules,
which canb e weighted based on o ccurence frequency,
likelihoo d,confusabilityorothermeasures. Theserules
are applied to a baseline lexicon in order to generate
someadaptedlexiconortooptimizeanacousticmo del.
Unfortunately this approach usually brings only little
improvement.
Inthis research, wesuggesta newdata-drivenap-
proachtodealwithpronunciationvariations. Itisbased
onword-levelpronunciationHMMs,which areapplied
torescoren-b esthyp otheses. Ourtargetis toimprove
thep erformanceofacontinuoussp eechrecognitionsys-
temonachallengingsp eakergroupsuchasnon-native
speakers.
Similar to the standard approach, we generate a
phonetictranscriptionwithphonemerecognizer. These
phoneme sequences are used as training data for dis-
cretewordHMMs; oneHMMforeachword. There is
noattempt to explicitly represent the phonemevaria-
tions. Evenvariationsunseen inthetraining dataare
allowed,asacertaino orprobabilityexistsforallp ossi-
blephonemesequencesforeachword. TheHMMtrain-
ingprocesswillimplicitlytakecareofallvariation-and
likelihoo d issues, unlike in otherapproaches, e.g. rule
ringfrequenciesdonothavetob ecalculated.
2. WORDHMMS
AsillustratedinFig.1,twolevelsofHMM-basedrecog-
nitionareinvolvedinthisapproach:
Acousticlevel: phonemerecognition togenerate
the phonemesequence Si fromthe acoustic fea-
turesO
i
Phoneme lab ellevel: Fortraining, thephoneme
sequences Si are considered as input. For all
words,adiscretewordHMMistrainedonallin-
stances of that word in the training data. The
acoustic feature vectors
phonemes o 1 o 2 o 3 o 4 o 5 o 6 o 7 o 8 o 9 o 10 o 11
phoneme recognition to generate phoneme sequences
train discrete HMM for each word on all instances of that word
s 1 s
2 s
1 s
3
o 12
s 4 s
3
w i 1 w i 2 w i 3
Figure1: Twolayers ofHMMsarerequiredtogenerate
pronunciationvariantsandtheir likelihoods: anacous-
ticlevelforphonemerecognitionandthephonemelabel
levelforwordmodeltraining.
mo delsareappliedforrescoring,generatingapro-
nunciationscoregiventheobservedphonemese-
quenceSiandthewordsequence.
The rst step requires a standard HMM acoustic
mo del, andpreferablysome phonemebigram language
mo delasphonotacticconstraint.Thecontinuoustrain-
ing sp eech data is segmented to word chunks based
on time information generated by viterbi alignment.
Acoustic feature vectorsare deco ded to an 1-b est se-
quenceofphonemes.
Foreachwordinthevocabulary,onediscreteuntied
HMMisgenerated. Figure2showsanexampleforthe
word\and".
Enter ae .49 ax .49 ...
n .99 ...
d .99 ...
Exit
Figure2: AnexamplediscretewordHMMfortheword
\and",initializedwithtwopronunciationvariationsfor
therstphoneme.
Themo delsareinitializedonthephonemesequence
in some baseline pronunciation lexicon. The number
of states fora wordmo del is set to b e thenumber of
phonemesinthebaselinepronunciation,plusenterand
exitstates. Eachstatehasadiscreteprobabilitydistri-
bution ofallphonemes,giving thebaselinephonemea
high probabilityand allotherphonemessome lowbut
non-zerovalue. Forwardtransitionb etweenallstatesis
allowed,withinitialtransition probabilitiesfavouringa
paththathitseachstateonce.
2−37
4L-1 情報処理学会第66回全国大会
probabilitiesarereestimatedonthephonemesequences
ofthetrainingdata. Foreachword,allinstancesinthe
trainingdataarecollected and analyzed. The number
ofstatesofeachwordmo delremains static. Phoneme
deletionsarecoveredbystateskiptransitions,phoneme
insertionsaremo deledbystateself-lo optransitions.
Datasparsenessis acommonproblemforautoma-
tically trained pronunciation mo deling algorithms. In
thisapproach,pronunciationsforwordsthatdoappear
sucientlyfrequentinthetrainingdata,thepronunci-
ationsare generatedinadata-drivenmanner. Forrare
words,thealgorithmfallsbackonbaselinephonemese-
quencesfromagivenlexicon. Thiscombinationshould
make it more robust thanfor example anapplication
ofphonemeconfusionrulesonalexicon(ase.g.in[2])
could.
3. EXPERIMENTS
3.1. Phonemerecognition
Forevaluation,weusedanon-nativedatabasecollected
atATRandconsisting of 11Japanesesp eakersof En-
glish.Ab out12minutesofreadsp eechareavailablep er
speaker,whichwasdividedintotenminutesfortraining
andtwominutesastestset. Thetaskdomnainishotel
reservation.
Thenon-nativetrainingdatasetis segmentedinto
single words based on time information aquired by
viterbi alignment. On these word chunks, phoneme
recognitionis p erformed. Toarchievehigher phoneme
recognition accuracy than with monophones, a right-
context biphone mo del is applied. In the resulting
phonemestring,thecontextisnotconsidered,though.
Thephoneme recognition accuracyfor the non-native
taskis34.68%relativetothecanonictranscription.The
biphoneacousticmo delinthisexp erimentistrainedon
theWallStreet Journal(WSJ) read sp eechcorpus [3]
Thephonemeset consistsof43phonemes plussilence.
In the second level of pro cessing, the rescoring, o c-
curences of silence are ignored. The HTK to olkit [4]
isusedforalltraininganddeco dingsteps.
3.2. Word HMMinitialization
The discrete probability distribution for each state is
initialized dep ending on the \correct" phoneme se-
quence(s)asgiveninthelexicon. Thecorrectphoneme
hasaprobabilityof0.99;ifmorethanonepronunciation
variantisincludedinthelexicon,thevariationsallhave
thesameprobability. Allotherphonemes areassigned
somenon-zeroprobability.
Thetransition probabilitiesdep endonthe number
of succeeding phonemes in the baseline lexicon. The
probability to skip k phonemes is initialized to 0:05 k
.
Insertionsareallowedwithachanceof0.05. Thetran-
sition to the next state therefore has a probability of
slightlyb elow0.9.
3.3. Rescoring
TheHMMpronunciationmo delsareappliedintheform
ofrescoringthen-b estdeco dingresult.Onanutterance
inthetestdata,b otha1-b estphonemerecognitionand
a standard n-b est recognition (on word level) is p er-
formed. For eachof the n-b est sequences, weapply a
forcedalignmentusingthediscretepronunciationmo d-
els, the phoneme sequence as input features and the
wordsequenceaslab els. Theresultingscoreisthepro-
nunciationscore.
weightedlanguagemo delscoreforthishyp othesis. The
hypothesisarchievingthehighesttotalscoreamongthe
n-b estisselectedascorrect. Figure3showsthep erfor-
mance for various language mo del weights. The b est
p erformance is 29.04% word error rate (WER) com-
pared to baseline p erformance in this exp eriment of
32.54%.
Figure3: Worderrorrate forrescoringof n-bestbased
onpronunciationscorecombinedwithweightedlanguage
modelscores.
4. CONCLUSION
Worderrorratecouldb eimprovedbyarelative10.8%
withpronunciationrescoring,showingthe eectiveness
oftheapproachfornon-nativesp eech. Thefullstrength
oftheapproachmaynotb eachievedinthisevaluation
b ecauseof lackof non-nativetrainingdata,which fre-
quently forcesword mo delsto defaultto the standard
pronunciations. Also,consideringtheacousticscoreto-
gether with pronunciation and language mo del score
couldb eahelpfulextension.
5. ACKNOWLEDGEMENT
The research rep orted here was supp orted in part by
a contractwiththe TelecommunicationsAdvancement
OrganizationofJapanentitled,"Astudyofsp eechdia-
loguetranslationtechnologybasedonalargecorpus".
6. REFERENCES
[1] HelmerStrikandCatiaCucchiarini,\Mo delingpro-
nunciationvariationforASR:Asurveyofthelitera-
ture," SpeechCommunication,vol.29,pp.225{246,
1999.
[2] Rainer Gruhn, Konstantin Markov, and Satoshi
Nakamura,\Probabilitysustainingphonemesubsti-
tution fornon-nativesp eechrecognition," inProc.
Acoust.Soc.Jap.,Fall2002,pp.195{196.
[3] D.B.PaulandJ.M.Baker, \Thedesignforthewall
streetjournalbasedCSRcorpus,"inProc.DARPA
Workshop,PacicGrove,CA,1992,pp.357{362.
[4] P.Woo dlandand S.Young, \TheHTK tied-state
continuoussp eechrecognizer," inProc.EuroSpeech,
1993, pp.2207{2210.