• 検索結果がありません。

Development of Preschool Children Subsystem for ASR and Q&A in a Real-Environment Speech-Oriented Guidance Task

N/A
N/A
Protected

Academic year: 2021

シェア "Development of Preschool Children Subsystem for ASR and Q&A in a Real-Environment Speech-Oriented Guidance Task"

Copied!
4
0
0

読み込み中.... (全文を見る)

全文

(1)ゆ. INTERSPEECH 2007. Development of Preschool Children Subsystem for ASR and Q&A in a Real-Environment Speech-oriented Guidance Task Tobias Cincarek, /zumi Shindo, Tomoki Toda, Hiroshi Saruwatari, Kiyohiro Shikano Graduate School of Information Science Nara Institute of Science and Technology, Japan. {cincar-t,shikano}@is.naist.jp Abstract. Table 1: Real-environment, spontaneous speech data collected with the speech-oriented guidance system Takemaru-kul1 were employed for preschool children experiments. Th巴 development of a module for speech recognition and an­ swer generation for preschool children for a speech-oriented guidance system is described. This topic requires extra treat­ ment because the performance is still disproportionally low to children of higher age, there is a growing business demand and only relatively few research on preschool children ASR has been carried out. This is especially true for building practical applications. A real-environm巴nt speech database with more than 12,000 utterances of Japanese preschool children and more than 60,000 utterances of school children are employed for sys­ tem development. The difference between preschool children's and standard pronunciation is narrow巴d by introducing uniform reference transcriptions and pronunciation modeling. Further­ more, language and acoustic model are optimized. Final evalua­ tion shows, that the speech-oriented guidance system's response accuracy can be improved by more than 12% absolute. Index Terms: preschool children, acoustic model, selectiv巴 training, pronunciation modeling, Q&A. |. Speaker Group / Data Child回n Data (P+L+H) (P)reschool Children (L)o岬er Grade Children (H)igher Grade Childr官n Evaluation (Preschool). tems for children using speech recognition technology, e・E・ [1, 10, 11]. Furthermore, only a few of previous investigations focus on spontaneous speech, e.g. [4], while most of the oth­ ers are restnct巴d to read speech, e.g. [2, 3, 6, 8]. Sp巴ech ma­ terial employed for system development in this paper are from preschool children. Real-environment, spontaneous speech data was collected with the public speech-oriented guidance system “Takemaru-kun " [11, 12]. After d巴scribing the baseline system built with just a few months of real-environment spontaneous children speech data, the system is adapted for preschool chil­ dren using real speech data which has been collected over two years. Besides more tolerant pronunciation modeling, optimiza­ tion of the acoustic and the language model are considered. 1. Introduction Children have already been using video game devices, comput­ ers and cellular phones for a long period of time. The group of potential users is b巴coming younger and speech recognition technology has be氾n started to be employed for such devices mr巴cent years. However, difficulties especially regarding the recognition performance remain. Th巴 younger the user, the more difficult is the recognition of speech. Previous research on children speech recognition has most often been mainly focusing on children of six or more years of age [1, 2, 3, 4], investigating speech characteristics [5, 6] and employing a method for vocal tract length normalization (VTLN) [7, 2, 3, 8, 9]. Differences between adult and children speech are shifts in fundamental and formant frequency which can be explained anatomically. VTLN is employed to reduce the difference between adult and children spe巴ch features for recognition with an adult acoustic model. If enough children speech data is available, construction of a child-speci日c acoustlc model would be a di仔erent way overcome this problem. How­ ever, the main reason for a lower recognition performance is the unstable speech production process (articulatio日) of speech sounds by children. Spectral and cepstral characteristics of dif­ ferent realizations of the same speech sound of during a certain realization vary remarkably [5]. Cons巴quently, in this paper as an additional approach, the number of possible pronunciations per word is increased in order to cope with different acoustic realizations of the same word. There are only a few examples for building practical sys-. 2. Real-Environment Speech Database Spontan巴ous Japanese children speech from th巴 Takemaru database is employed. Takemaru-kun [11] is a speech-oriented dialogue system intended to provide the user information on the weather, news, the surrounding environment, public trans­ portation system, Internet pages, etc. The system is very popu­ lar among children, b巴cause it is based on an animated char­ acter. It is a working system installed in a public place in Nara, Japan. The system collected more than 700,000 speech and noise inputs since November 2002. All data recorded during the 自rst two y巴ars are completely transcrib巴d, labeled with tags (巴 g. noise) and classi自ed subjectively into speaker groups: preschool children, elementary school children, junior high school children, adults and elderly persons目 Preschool and school children speech data employed for experiments is shown in Table 1. The evaluation data set was selected so that there is one randomly drawn utterance each for the most frequent 1,000 transcnpts.. 1469. -145 -. August 27-31, Antwerp, Be1gium.

(2) AM Construction LM Construction ASR Engir】E Acoustic恥10del Acoustic Features Language Model. HTK 3.2 [14] SRILM 1.5.0 [15] Julius 3.5 [16] PTM [17], 2,000 states, 8,256 Gaussians 12 MFCCムMFCC. Ó. E LR 2-gram, RL 3-gram,Kneser-Ney. 2.1. Preschool Children Data. The preschool children speech data is remarkably di仔erent from the speech data of lower grade and higher grade school children. In a preliminary evaluation experiment the recognition perfor­ mance was 76.9% for higher grade school children and 75.2% for lower grade school children, but only 45.5% for preschool children even when employing an acoustic model train巴d on collected data of each corresponding speaker group. Different pronunciation of words which appear randomly, a lower speak­ ing rate and differences in sentence construction and word us­ age are the three most pr巴valent characteristics. There are also differences among preschool children, since they usually grow up in a different environment (di仔"erent parents).. 3. Baseline System The m勾or components of the speech-oriented guidance system are a speech recognizer and a response generation module as shown in Figure 1. The acoustic model is a context-dependent phonetic-tied mixture (PTM) model. The initial model is con structed from the JNAS database [13] and then r巴trained with all available children speech data (cf. Table 1). The baseline lan­ guage model has been built from one million sentences of text data collected from the web, transcriptions of about 17,000 chiト dren utterances and about 6,000 questions devised by humans The vocabulary size is about 42k and the pronunciation dictio­ nary contains only one possible phoneme sequence per word Other experimental conditions are uniform. They are given in Table 2.. 4. System Optimization The baseline system has been built using relatively few data af­ ter the system has been installed in a public place. In the follow-. 2 3 4 5 6. (Preschool) (Lower Grade) (Higher Grade) (Pre)+(Lower) (Pre)+(Lower)+(High) (Preschool/Std). I Ik 51k 17k 62k 79k l lk. 7 8. (Pre/Std)+(Select) (Pre)+(Select). 21k 21k. y 日 二 、 一 日 -4,‘ Fi 円 万 K一 一. Table 2: Experimental Conditions.. Ba犯lin巴. C =5 一. Figure 1: Th巴 speech-oriented guidance system and its m勾or components. Mode印刷. 4 = 1 一 r j ia. 円百I I 0I. K L = L r 一 ツ 石 一 E 一 一 H u 司 j e 一 U= = 一品作. Table 3: Different Acoustic Models, Training Data Size and Recognition Performance. Word correctness (Corr.) and word accuracy (Acc.) in percent. 54.0. 49.0. 51.2 37.8 52.5 51.4. 44.2 28.6 45.9 44.9. 55.2. 49.8. 57.6. 51.9. 56.7. 50.3. ing it is shown how the system performance can be optimized for preschool children, when two years of real-environment children speech data b巴come available. 4.1. Acoustic ModeIing. Several acoustic models built by retraining the baseline model on different children speech data sets are considered. The re­ sult 凶given in Table 3. The best performance is naturally ob­ tained for the matched case (1), i.e. when training the initial model with preschool children data. When employing data from lower grade (2) or higher grade school children (3) the recog­ nition performance degrades by about 5% and 20% absolute, respectively. Adding lower (4) and higher (5) grade data to the preschool children data has also negative e仔ects. Training with preschool children data when using standardized transcriptions (6) improves the performance slightly in comparison to when using speech sound conform transcriptions. This is also natural, since the referenc巴 transcriptions were also normalized w.r.t. standard pronunciation. Instead of employing the whole children speech data p∞l for training, selection of speech data acoustically cJose to preschool children data is conducted 】n advance. For data selec­ tion a previously proposed greedy ML algorithm for utterance­ based selection of task-speci自c data is employed. The method has been successfully applied for a building a task-spec泊c acoustic models in supervised and unsupervised manner [18, 19]. The a1gorithm starts with the whole children speech data pool and successively tests each utterance for del巴tion. A train­ ing utterance is removed from the data p∞1, if its independent deletion improves the retrained model's likelihood for a sep­ arate data set of preschool children data. When iterating th巴 selection algorithm two times, 16% or about 10k utterances of the data pool (Iower and higher grade children data, 62k utter­ ances) are selected. Combining the selected and the preschool children data, performance improves to 51.9% (7), which is 23% higher than when using only preschool children data (6) and 5-7% higher than when employing the whole children speech data pool (5) for training 4.2. Language ModeIing. The baseline language model is trained on web data and tran­ scriptions of several thousand children utterances. The perfor­ mance of this model is compared to five language models which are trained on transcriptions of children utterances only. Acous­ tic model (7) from the previous section is employed for this. O UF n p j A吐 A斗 司14.

(3) Table 5: Examples for pronunciation mapping rules. Table 4・Performanc巴 for language models train巴d on different data sets when using acoustic model (7).. |ID I |1 I 2 I. Type sub sub. ModellData Web-based Model. 10S. (P陀) (P印)+(Lo岬) (P陀)+(Low)+(High) (Pre/Std) (Pre/Std)+(Lo岬). sub del sub. de de N N ba de. Source su ka su ka . ch i 1 sa ba s u. ka. de de N N ba de. Target sh u ka ka ch u N ch i sh a. .. q. ka. Fr巴q. 303 162 122 46 43 40. Table 6: Application of pronunciation conversion rules ranked by absolute or relative appearance frequency. Acoustic model (0) and language model (5) are employed for evaluation.. investigation. From Table 4 it is clear, that the transcription­ based language models (2)ー(6) perform better than the web­ based baseline model (1). It has a positive effect when adding lower grade (3) and higher grade school children data (4) to the preschool children data (2) for model training. The rea­ son is likely to be the fact, that in contrast to the lower and higher grade data, the preschool children speech transcripts con­ tain non-standard words. Consequently, a language model (5) trained on normalized preschool children data transcripts was examined finally which yielded the best performance. Here, adding lower grade (6) data to (5) did not improve the perfor­ mance further.. unfamiliar with the Japanese writing system, only phoneme se­ quences are shown. The source part of each mapping is compared with every word entry in the pronunciation dictionary. For each rule and word matching one extra pronunciation variant is added to the pronunciation dictionary. The effect of applying only the k highest ranked conversion rules is investigated in the following experiment. Conversion rules may be rank巴d either by their ab­ solute or relative appearance frequency. For performance evalu­ ation, acoustic model (0) and language model (5) are employed. The result is shown in Table 6. When ranking the pronunciation rules by their absolute (re卜 ative) appearance frequency, a peak in performance seems to be reached after 400 (600) pronunciation mappings have been applied. There is an absolute improvement of up to 3.5% in comparison to when using only a standard pronunciation dic­ tionary showing the eff,巴ctiveness of pronunciation modeling However, there was no significant increase in performance by editing the pronunciation dictionary when employing acoustic models (6,7). This is due to the fact that the preschool chil­ dren's idiosyncratic way of pronouncing words has already been “learned " by the acoustic model, since normalized utt巴rance transcriptions hav巴 been employed during their training. 4.3. Pronunciation恥fodeling. The purpose of pronunciation modeling is to be able to recog­ nize the normalized form of words although their pronuncia­ tion has been altered by the preschool children in an idiosyn­ cratic way. For example, preschool children pronounce 'Take­ maru' as 'Tachimaru', 'Takebaru' or 'Takemau'. Nevertheless, all three pronunciation variants should be recognized as 'Take­ maru'. Therefore, it is proposed to add these pronunciation vari­ ants for the word 'Takem訂正in the pronunciation dictionary used during recognition. In order to understand the idea of the proposed approach in more detail, the reader should be aware, that in the Japanese lan­ guage there is a one-to-one coπespondence between graphemes and speech sounds, i.e. a bijective mapping between Japanese characters and phoneme pairs can be established. Except a few Japanese characters for th巴 'N' sound and the自ve vowels 'a', 'i', 'u', 'e', '0', each character of the Hiragana/Katakana writ­ ing system is pronounced as a consonant followed by a vowel. Hiragana/Katakana are the Japanese “alphabets " by which any Japanese word can be written as it is spoken. (Besides from these “alphabetsヘthe Japanese language uses Chinese charac­ ter, the so-called Kanji to remove meaning ambiguities). In this paper pronunciation modeling is considered at the Hira­ gana/Katak加a level. The preschool children training data is transcribed in two ways. 4.4. Response Accuracy. Since the application consid巴red in this paper is a guidance sys­ tem, it is imperative to evaluate the system's response accu­ racy in order to see whether an improvement in speech recogni­ tion accuracy transfers to a increase in response accuracy. Re­ sponse generation is based on a question imd answer data base (QADB), which contains Q&A pairs for approx. 300 responses. The questions are taken from the transcripts of the collect巴d speech data, the corresponding answers are assigned manually by humans. Three QADBs ar巴 considered here: (Presch) based on the collect巴d preschool children data, (Child) built from all available children data and (Closed) which contains only the Q&A pairs co汀esponding to the transcription of the test data The evaluation result is shown in Table 7. Updating either the language or the acoustic model has equally high positive effects. 1. Transcription of how words were actually pronounced 2. Transcription of each word's standard pronunciation Context-sensitive pronunciation mappings are obtained from the forced alignment of the transcriptions (1) and (2) There a問substitution (3→3 characters), deletion (3→2 characters) and insertion (2→3 characters) rules. After dis­ carding rules with an absolute occurrence frequency of one, there were 1,295 substitution, 57 insertion and 35 deletion rules Extra substitution rules have been derived for word begin and end by ignoring either the left or the right context, respectively Several examples are given in Table 5. Since readers may b巴. ド. げ 幻 1.

(4) [5] S. Lee, A. Potamianos, and S. Narayanan, “'Analysis of Children's Speech: Duration, Pitch and Formants," in Proc. of Eurospeech, 1997, pp. 473-476.. Table 7: Response accuracy for different combinations of acoustic mod巴1,language model and pronunciation dictionary目. l. [6] Q. Li and M.J. Russel, “Why is Automatic Recognition of Children's Speech Difl白cult?," in Proc. of Eurospeech, 2∞1, pp. 2671-2674.. Baseline System. I AM(7) Update. [7] A. Potamianos, S. Narayanan, and S. Lee,“Automatic Sp巴巴:ch Recognition for Children:' in Proc. of Eurospeech, 1997, pp. 2371-2374.. LM(5) Update LM(5)+Dict. Update LM(5)+AM(7) Update LM(5)+AM(7)+Dict.. [8] A. Hagen, B. Pellom, S.Y. Vuuren, and R. Cole, “Ad­ vances in Children's Sp巴ech Recognition with an Interac­ tive Literacy Tutor," in HLT/NAACL, 2004.. in improving the system's response accuracy. When updating both models,the absolute improvement of their independent up­ date seem to add up. Editing the pronunciation dictior】ary has a positive effect, but it is 1巴ss promising than optimizing the acoustic model by training with normalized utterance transcrip­ tions and selecting additional training utt巴rances from lower and higher grade school children data. There were no further im­ provements when updating acoustic model and pronunciation dictionary at the same time. [9] D. Elenius and M. Blomberg, “Adaptation and Normal­ ization Experiments in Spee氾h Recognition for 4 to 8 Year Old Children," in Proc. of Eurospeech, 2∞5, pp. 27492752 [10] J. Gustafson and K. Sjölander,吋0】ce Transforrnation for Improving Children's Speech Recognition in a Publicly Available Dialogue System," in Proceedings of rhe In­ temational Conference on Spoken Language Processing,. 2002, pp. 297-300. [11] R. Nisimura, Y. Nisihara, R. Tsurumi, A. Lee, H. Saruwatari, and K. Shikano, “Takemaru-kun: Speech­ oriented Information System for Real World Research Platform," in Intemational Workshop on Language Un­ derstanding and Agents for Real '"伽ld Interaction, 2∞3, pp. 70-78.. 5. Conclusion In this paper automatic speech recognition and response gener­ ation for a real-environment speech-oriented guidance system has been investigated. For improving speech recognition accu­ racy of preschool children, acoustic, language and pronuncia­ tion modeling are considered. For language modeling, a model trained on the normalized transcriptions of more than ten thou­ sand preschool chiIdren utterances yielded the highest perfor­ mance. A previously proposed method for selective training of the acoustic model was applied successfully for augmenting the existing preschool children training data with an appropriate subs巴t of the school children data. Context-dependent pronunci­ ation modeling applied at the Japan巴s巴 'syllable' level was less promising. Finally, the response accuracy of the system was evaluat巴d. Improvements in r巴cognition accuracy transferred in the same degree to an improvement in respons巴 accuracy. In the end a response accuracy of mor巴 than 60% was achieved. [12] R. Nisimura, A. Lee, M. Yamada, and K. Shikano, “Op­ erating a public spoken guidance system in real environ­ ment," in Proc. of Eurospeech, 2∞5,pp. 845-848 [13] K. Ito, M. Yamamoto, K. Takeda, T. Takezawa, T. Mat­ suoka, T. Kobayashi,K. Shikano, and S. Itahashi,“JNAS: Japanese Speech Corpus for Large Vocabulary Continu­ ous Speech Recognition Research," The Joumal of the Acoustical Society of Japan, vol. 20, pp. 199-206, 1999 [14]“HTK. Speech. Recognition. http://htk.eng.cam.ac.uk/:. Toolkit,. [15] A. Stolcke,“SRILM・An Extensible Language Modeling Toolkit," in Proc. of ICSLP, 2∞2, pp. 901-904.. 6. Acknowledgments. [16]“Julius, an Open-Source Large Vocabulary CSR Engine -. A part of this work is supported by the MEXT e-Society pr句巴ct, Japan.. http://julius.sourceforge.jp/: [17] A. Lee,T. Kawahara,K. Takeda, and K. Shikano,“A New Phonetic Tied-Mixture Model for Efficient Decoding," in. 7. References. Intemational Conference on Acousrics, Speech, and Sig・. [1] s. Narayanan and A. Potamianos, “Creating Conversa­ tional Interfaces for Children," IEEE Trans. on Speech and Audio Processing, vol. 10, no. 2, pp. 65-78, 2002. nal Processing,. 2∞0, pp. 1269-1272.. [18] T. Cincarek, T. Toda, H. Saruwatari, and K. Shikano, “Selective EM Training of Acoustic Models based on Sufficient Statistics of Single Utterances," in Automatic Speech Recognirion and Understanding "ゐrkshop, 2005, pp. 168-173. [2] D. Giuliani and M. Gerosa,“Investigatir】g Recognition of Children's Spe配h," in Intemational Conference on Acoustics, Speech, and Siglwl Processing, 2003, vol. 2, pp. 137-140.. [19] T. Cincarek, T. Toda, H. Saruwatari, and K. Shikano, “Acoustic modeling for spoken dialogue systems bas巴d on unsupervised selective training," in Proc. of ICSLP, 2∞6, pp. I 722-1725.. [3] G. Stemmer, C. Hacker, S. Steidl, and E. Nöth, “Acous­ tic Normalization of Children Speech," in Proc. of Eu­ rospeech, 2003,pp. 1313-1316 [4] S.M. D' Arcy, L.P. Wong, and M.J. Russel, “Recognition of Read and Spontaneous Children's Speech Using Two New Corpora," in Proceedings of the Intemational Con­ ference on Spoken Language Processing, 2004. 1472. - 148 -.

(5)

Figure  1:  Th巴 speech-oriented guidance  system  and  its  m勾or components
Table  4・Performanc巴 for language models train巴d on different  data sets when using acoustic model (7)
Table  7:  Response  accuracy  for  different  combinations  of  acoustic mod巴1,language model and pronunciation dictionary目

参照

関連したドキュメント

A tendency toward dependence was seen in 15.9% of the total population of students, and was higher for 2nd and 3rd grade junior high school students and among girls. Children with

Information gathering from the mothers by the students was a basic learning tool for their future partaking in community health promotion activity. To be able to conduct

Regres- sion analyses of the sequence data for thermophilic, mesophilic and psychrophilic bacteria revealed good linear relationships between OGT and the dinucleotide com- positions

 The aims of this study were to explore the trends in research on support for the siblings of children with diseases/disabilities and discuss future challenges related to this topic.

The classical Ehresmann-Bruhat order describes the possible degenerations of a pair of flags in a linear space V under linear transformations of V ; or equivalently, it describes

We construct a Lax pair for the E 6 (1) q-Painlev´ e system from first principles by employing the general theory of semi-classical orthogonal polynomial systems characterised

This implies that a real function is realized by a stable map if and only if it is continuous, thus further leads to an admissible representation of the space of continuous

First we use explicit lower bounds for the proportion of cyclic matrices in GL n (q) (obtained in [9, 14, 20]) to determine a lower bound for the maximum size ω(GL n (q)) of a set