An Evaluation of Discriminative Training for Hidden Markov Models in a Real-Environment Speech-Oriented Guidance System

全文

(1)An Evaluation of Discriminative Training for Hidden Markov Models in a Real-Environment Speech-Oriented Guidance System Denis Babani*， Tomoki Toda*， H iroshi SaruwatariぺKiyohiro ShikanoホヰNara lnstitute of Science and Technology‘ . Takayama 8916-5 Ikoma city， Nara Prefecture‘Japan E-mail: {伊baba叩m肝lト-d、λ刈tωomokiαL川、5臼a\\札wa官1百ta 凶an司3油st山h巾l. TABLE 1. Abstract-This paper presents experimental evaluations of Maximum Mutual Information discriminative training of the acoustic model in a real-environment speech-oriented guidance system "Takemaru-kun".. COMP^RISO刊OドWORD八CCUR八CY[%). BETWEEN. We have developed a real environrnent speech-oriented guidance syst巴m ， "Takemaru-kun" [リ， based on SDS (Spoken Dialog Sys tem) techniques consisting of large-vocabulary continuous speech. ^ND. M L E.. h仏�I. h仕E. 1. INTRODUCTION. MMl. 75.36. 77.45. 45.16. 45.62. 44.16. 45.46. 61.11. 63.82. 67.14. 68.75. 1. recognition (LVCSR)， inquiry classificallo日‘and text-to-sp巴巴ch (TTS). 64.54. components Since the 自rst day of operatio口、this system has been used as a source of spontaneous Japanese speech data. B. E\'{/!z回tion Reslllts. In this work we inyeslIgat巴 the bene自ts of lIsing MaXlmum MlItual. The白rst comparison was done by increasmg language scale factor. Information (MMI) training [2] using such real-environment、spon. paral11eter， which gave si伊11日cant il11pro\'ements in word accllracy. taneous speech data recorded by the "Takemaru-kun、system. The. This setting is responsible for including more acoustically competitive. evalllation is done by changing various initial and training conditions‘. hypotheses in word lattices and improving the discriminability of the. and then comparing the word accuracy rate. The experimental results show that 恥仏11 training yields signi自cant impro\'ements. m. acollstic model. During this evalllation we observed that a 2-gram. \\iord. language model yielded better word accuracy than a 1司gram one.. accuracy of "Takemaru-kun" system compared with the conventional. These results are not consistent with [2]. The type of language l110del. Maximurn Likelihood Estimation (MLE) training. lIsed in lattice generation is strictly connected with generalization of 1\仏11 training. However， s引mc巴 the p 巴町rfiおormance of t出he ac∞ous凱鈎Ilc. 11. SPEECH-ORIENTED GUIDANCE SYSTEM "TAKEMARU-KUN". r口Illodel in. "Takemaru-kun" is a speech-guidance syst巴m installed in No口h. cons剖tr悶amt凶s ar閃e s幻tl川11巴{σfe氏Ctlv巴 for im】proving qualitザy of word lat口tJces. community center ofIkoma city， Nara prefecture， Japan. The purpose. Results of changing som巴 parame凶t巴r凶s in model lIpdate， such as. of this system is to handle qlleries related to the agent. general. I-smoothing par出neter and th巴 acoustic scale factor， showed no. information and abollt surrounding area. Since the installation day，. slgnificant differences in word accllracy for all speaker groups. These. "Takemaru一kun" system has been recording user inquires噌however. results have also been observed in the other sp巴aker groups. until now w巴have only the 白rst two years completel) transcnbed. Tabl巴1 shows the incr巴ased word aCCl江acy by M恥11 training from. by hurnans. These u口erances have been labeled as speech、nOlse or. MLE training in every speaker group. We can observe that 恥仏t{]. partially speech and have been subjectively classi白 ed into目、'e grollps. training yields word accuracy improvements around 2% absolllte in. related to age of the speaker (i.e. preschool， lower grade school. total.. children司high巴r grade school children， adults and elderly persons). IY. CONCLUSIONS. Utterances record巴d 旬、寸akemaru-k1m" are usually short in length. Being in a real environment， speech data usually is not clean司11. Based on reslllts of this evaluation司we can say thatル仏11 training. contains background noise， and even speech overlapping between. g，，'es slgnificant improvements. multipl巴 speakers. For this reason， this co叩us is adequate for evalu. ronment spontaneous speech data.. atmg恥仏11 perfo口nance in real case scenarios. III. word accuracy e\'en in real envi. Y. ACKNOWLEDGMENT The allthors are grateful to Dr. Erik McDermott for the invaluable. III. EVALUATION. advices conceming discriminative training techniques.. A. Experimental Conditions. REFERENCES. Evaluations were conducted separately for each speaker group. A common dictionary with 58k words was creat巴d in order to have. [1] R. Nishim町a， Y. Nishihara， R. Ts山-umi， A. Lee， H. Saruwatari， and. zero OOV (out of vocabulary) words for training utterances. Acoustic. K. Shikano. げTakemaru-kun: Speech-oriented lnforma口on System for Real World Research Platお円ηぺIl1lel加liol1a/ Workshop 011 Langllage Ul1derSlal1ding al1d Agel1ls jor Rea/ Wor/d In leracri on、pp . 70-78， 2003 [2] P.c. Woodland and D. Povey， "Large scale discriminative traini.ng of hidden Markov models for speech recognition"， Compllter Sp e ech ond Langllage. vol. J 6‘no. 1， pp. 25-47. 2002. models were bllilt from scratch for each spωker group lIsing their corresponding training utterances. AlI acoustic models consisted of 3state left-to-right triphone HMI\俗、with GMMs as output probability density. The acoωtic feature vector was a 25-dimensional vector includingムE (energy)， 12 MFCC and 12ムMFCC. 8 Proceedings of rhe Second APSIPA Al1IlLIal Sunllllir alld Conference (Srudenr SY11lPOSÎUIll)， page 8司. Biopolis. Singapore. 14ー17 December 2010. A吐 qJ 円L.

(2)