• 検索結果がありません。

Evaluation of Model Adaptation by HMM Decomposition on Telephone Speech Recognition

N/A
N/A
Protected

Academic year: 2021

シェア "Evaluation of Model Adaptation by HMM Decomposition on Telephone Speech Recognition"

Copied!
4
0
0

読み込み中.... (全文を見る)

全文

(1)1 g. a. h . c 怯 " u ny. a. . z ,K 中A "ug J 脇 "u • t ,也 中 4. EVA LUATION OF MODEL ADA PTATION BY HMM DECOl\在POSITION. 1. ON TELEPHONE SPEECH RECOGNITION Sato3hi Nakamural. Kiyohiro Shikanol. Ma3ato3hi Morúhima2. T03hihiro l3obe2. Graduate School ofInformation Science, Nara Institute of Science and Technology 8916・5, Takayama-cho, Ikoma-shi, Nara, 630-0101, JAPAN 2Laboratory for Information Technology, NTT DATA Corporation 66-2, Horikawa-cho, Saiwai-ku, Kawasaki-shi, Kanagawa, 210・0913, JAPAN E-mむ1: tetuy-t@is.aist-nara.ac.jp. ABSTRACT In this paper, we evaluate performance of model adap­ tation by the previously proposed HMM decomposition method[11 on telephone speech recognition. The HMM de­ composition method separates a composed HMM into a known phoneme HMM臼d姐 unknown noise and chan­ nel HMM by maximum likelih∞d (ML) estimation of the HMM parameters.. A transfer function (telephone chan­. Soundpr∞rr∞m. nel) HMM is estimated using adaptation speech data by applying the HMM decomposition twice in the linear spec・ tral domain for noise and in the cepstral domain for chan­ nel. The telephone speech data for evaluation are recorded through 10 kinds of ordinary analog telephone handsets and cordless telephone handsets.. The test resuIts show. that the average phrase aιcuracy with the clean speech. Figure 1: Recording condition of telephone speech. cord!ess telephone handsets in a soundproof room, through the public telephone network as shown in Figure 1.. HMMs is 60.9% for the ordinary 回alog telephone hand­. 2.. sets,日d 19.6% for the cordless telephone handsets. By the HMM decomposition method, the average phrase accuracy is improved to 78.1% for the ordinary analog telephone handsets, 回d 50.5% for the cordless telephone handsets.. TELEPHONE SPEECH DATA. Figure 1 shows the recording condition of the telephone. speech. Utterances írom 60 speakers in the ASJ(Acoustical Society of Japan) continuous speech database 訂e out­ putted through a mouth simulator, and inputted into 10. 1.. kinds of the ordinary日alog telephone handsets and the. INTRODUCTION. cordless telephone handsets in the soundproof room. Then,. Many methods have been proposed to cope with prob­ lems日used by additive noise祖d convolutional distortion in robust speech recognition.. Speech enhancement and. model compensation approaches are two co=on examples among them. For the speech enhancement approach, spec・ tral subtraction for additive noise and cepstral me回 nor­ malization(or signal bias removal) for convolutional distor­ tion have heen proposed (e・g., [2, 3, 4, 5, 61). For the model compensation approach, conventional muIti-template tech­ nique, model adaptation (eι[12, 131) as well as model. (de-)composition melhods (e.g., have been developed.. [7, 8, 9, 10, 11, 14, 15]). their speech are recorded through the public telephone net­ work. Ten kinds of telephone handsets are CANON (CF­ H1CL), KENWOOD (IS-W757), NEC (Speax23 CL), NTT (CP-D40), PANASONIC (VE-D67L-K), PIONEER (TF­ JP50), SANYO (TEL-L710), SHARP (CJ・H7・B), SONY (SPP-A600) 回d VICTOR (TN・DJ1・B). Each telephone handset consists of an ordinary analog telephone handset 姐d a cordless telephone handset. Figure 2 shows the log power spectrum of the clean speech and the telephone speech, which are digitized at回 8kHz sampling rate. In the case of the speech through the cord­ less telephone handset, the shape over 3kHz is distorted.. In our previous paper [11. we proposed臼 HMM decom­ position melhod. The HMM decomposition method deals with the model parameter instead of the series of the ob­. The SNRs of the ordinary 祖alog telephone handsets回d the cordless telephone handsets are 25.1dB and 20.3 dB, respectively. Their SNRs are calculated by. served signal, and estimates the HMM parameters based on maximum likelihood (ML) estimation. Its effectiveness. SNR. is confirmed by the word recognition experiments on the real distant-talking speech.. ( t). �. t )2 t 2::=1-' ' ­ ' 0(OIU 土 乞 二I n(t)2 '. 10101!.内. This paper reports the performance of the HMM decom­. wnere. position method on the telephone speech recognition. The. noise at time t, respectively.. telephone speech data for evaluation are recorded using 10. total frames of speech data and the number of total frames. kinds of the ordinary analog telephone handsets and the. of noise data, respectively.. ICSLP'98. o. andπ(t) denote the observed speech and the. 1 and. m. are the number of. 1507 123.

(2) w 川. 目 』UE a s D色 国 邑 F Z S 邑 百一E g 司一E 国. and the observed noise a.t frame m and frequency ω, re ­ spectively. H(ω;m) 回d H'(ω; 711) are transfer function. Accordingly, a. composed HMM of lhe observed speech io. Clean speeeh. tbe linear spectral doma.in is represented by. 'W・』叫r山『ム. ,ム・�. Frequency. <41くHz:). slral doma.in: λH..�. u_ 田町一. 品位」“. sine transform of the distribution function and logarithm. <4KHz). transform of the distribution function, respective1y. equation. mated in noisy environment. First1y, the HMM decompo­. sition method is applied in the linear spectral domaio to. estimate the telephone speech HMMs which are free from the inßuences of noises.. <4KHz:). the transfer function HMM. The proccdure is summarized in Figure 1.. N1仁川. 4.1.. About 7500 sentcnces from 25 males 回d 25 females are. used for the training. Five males阻d five females for tbe. speech. testing are not used in the training. Ea.ch testiog speahr. utters only one sentcnce for adaptation for ea.ch ba:ロdset.. Figure 3: Eovironmeol model for lelephone speech. We cb∞se 55 context independent phonemes as the c1eao. HMM DECOMPOSITION. speecb unils. Each phoneme is modeled by a single left-tcト. The HMM decomposilio口 mclbod separales a composed HMM iolo a knowo phooeme HMM回d日 unkoowo noise. 日d cba.noel HMM by ma.ximum likelibood (ML) eslima­ lion of lbe HMM p訂畑山sl1l.. Figu同3 s�ows ao envirooment model for tbe lelcpbone. speech. The observed speech O(ω;m) is町Jfesenled by. wh er巴 N(ω;m). right 3-state lied-mixture HMM with 3 self-lransitioo loop5. and wilbout sta.le skipping.. Sixtccn mel-frequeocy cep­. slral coefficienls(MFCC) witb tbeir日rst order differe日lials. (企MFCC), a:nd the firsl order differentials for normalized. logaritbmic �iteríù' (ð.powcr) are calculaledぉtbe obser. vation vector for each fra.mc.. There are total 256 Gaus­. sian mixture componenls wilh diagonal covaria.nce malri­. ces shared by all of the models for MFCC叩d ð.MFCC,. {S(ω;m) + N(BG)(ω; m)}' H(ω;m). respcctively.. There are 61 Gaussian mixturc compo目白l5. sbared by all of the models for ð.power.. +N(CH)(ω;m) . H'(ω;711) S(ω; m) ' H(ω; m) + N(ω;m),. For eovironment adaplation, a siogle Gaussian is employed to model the noise and lhc traosfer fUDclion. Ooly. mを却. vector is eslimated for the traロsfer function in tbis exper­ =. N(BG)(ω;m)'H(ω; m)+N(clI)(ω;日).H'(ω;m).. S(ω;m), N(BG)(ω;m), N(CH)(ω;m), and N(ω;m) de口ole lhe clea.n speech, lhe background noise, lhe channel ooise. imenl.. �. The phrase recognilion experiment is carried out using COD­ 7 tiouous seロtence specch. Each sentencc includes 6. 1508 124. Experimental condition. phone speech data which we described in the section 2. ー一、 一了. =. EXPERIMENTS AND RESULTS. All cxperimenls in this pa.per are conducted on tbe tele­. I H' I N{B GI下( --〆:、一一I「τ了寸 /十、 作一一一一ー 0 H Iーーイ ) � r1リ t 」一一_) Ob5e円ed 51耳目h s. =. Then, tbe. HMM decomposition method is applied again to estimale. 4.. 。(ω;m). Thc obta.ined telephone speech. HMMs are converled to the cepslral domain.. Tran5fer funcùon. The. shows that the HMM decomposition melhod. is applied twice in the linear spectral doma.in and in the. Figure 2: Log power spectrum /u/. Chll1nel noi�e. (2). cepstral doma.in, where the transfer function HMM is esti­. 1.........1._......1. Frcqucncy. e denotes a model. decomposition procedure. Cos-I and Log are inverse co­. ifvh旬、州内J刊i. 3.. (2). Cos-I {Log(λSH+N e入Nr;J)θλS.<�,. where cep and lin denote the cepstral doma.in and lbe. Cordlcss telephonc. Clell1. =. linear spectral doma.in, respectively.. Frcquency. 01日. (1). According to the equa.tion (1), the estimation equation of tbe transfer function HMM is writlen as follows in the cep­. Ordinary telephone. l由......�....w. ー .......,.ι. B.ckground. Exp{ Cos (Às,,� EÐ入Il,,�)} EÐλNII"'. where λ日d EÐ denote a set of model para.meters阻d a. model composition procedure, respeclively. Exp and Cos are exponential transform of the distribution function a.nd cosinc transform of tbe distribu tion function, respectively.. 国 o _;). 宝山. =. λSH+N.

(3) lnitial. HMM. Oc�p. Cepstral domain. ).SH ーイ).SH) =釘gmax P(λs;+N lλSH.λN) λSH : 一 Decomposition : of. .‘ーーーーーーーーーーー------,. Decomposition of. Power spectral " - Ñ- i ""edomain os. � ー: i、 HMM2J. ).H. ,. Cepstral domain. λH=紅g一日 P (λ�+H Iλs' ).H) ^H. Clean ‘. 1. L HMM As j. Figure 4: Parameter estimation by HMM decomposition. Model. l. HMM-TELE( ordi・. (ordinary tele.) nary and co吋less). Noise compensation Channel compensation. 72.7. Phrase accuracy. Model Noise compensation Channel compensation. 60.5. Phrase accuracy. phrases on average. ln this task, the ASJ database is di­. HMM-S and the noise HMM, is improved to 70.1% for. vided into 10 subsets. Each subset consists of 50 sentences. the ordinary analog telepbone handsets,回d 30.3% for the. except one subset which consists of 53 sentences. One typ­. cordless telephone handsets.. ica1 subset of this task is 323 phrases with a phrase per­. composition method twice in the linear spectra1 domain. plexity of 323 on average. Each speaker utters 3 subsets. 祖d in the cepstra1 domain, HMM-SHN, the phrase ac­ curacy is improved from 60.9% to 78.1% for the ordinary. through one telephone handset.. 4.2.. 回a10g telephone handsets, and from 19.6% to 50.5% for. Experimental results. the cordless telephone handsets with one adaptation sen­. The points to be investigated are: •. tence.. improvement of recognition rate by the HMM compo­ sition and the HMM decomposition method,. •. Table 1 and Table 2 a1so include the average phrase accu­ racy for 10 kinds of the telephone handsets in the matched. comparison with cepstra1 me回 subtraction(CMS).. condition.. The HMM phonemes, HMM・TELE(ordinary. tele.)むe trained by the sp田ch data through 10 kinds 。f the ordinary ana10g telephone handsets.. 回d •. By applying the HMM de­. comparison with matched condition.. speech data through 10 kinds of the cordless telephone. Table 1 and Table 2 show the average phrase accuracy[%J for 10 kinds of the ordinary analog telephone handsets組d the cordless telephone handsets, respectively. The phrase accuracy with the Clean HMMs(indicated as HMM・S) is 79.2% for the clean speech. The telepbone speecb, .how­ ever, decreases tbe pbrase accuracy to 60.9% for the ordi­ nary analog telephone handsets, and 19.6% for the cordless telephone handsets.. hl!.Ddsets.. The HMM phonemes, HMM-TELE(ordinary. 臼d cordless), are trained by the speech data through 10 kinds of the ordinary ana10g telephone handsets and the cordless telephone handsets.. The phrase accuracy with. the HMM-TELE(ordinarytele.) is 77.7% for the ordinary ana10g telephone handsets. The phrase accuracywith the HMM-TELE( cordless tele.) is 61.0% for the cordless tele­ phone handsets. On the other hand, the phrase accuracy with the HMM-TELE(ordinary and cordless) is decreased. The phrase accuracywith the HMM-SN, composed of the. ICSLP'98. The HMM. phonemes, HMM-TELE(cordless tele.), are trl!-ined by the. to 72.7% for the ordinary回a10g telephone handsets, and. 1509 125.

(4) would be necessary for the cordless telephone speech.. n -n ­ O 一O 一. a 一a -. nr -n y一 一 一吋 d. 回d 61.0% for the cordless telephone handsels. Therefore, the furtber improvement of the HMM a.dapta.tion method. 6.. REFERENCES. 1. T.Takiguchi, S.Na.kamura., Q.Huo, K.Shikano, UAdap­ lation of Model Pa.rameters by HMM Decomposition. in Noisy Reverbera.nt EnvironmentsヘESCA-NAT O Work3hop on Robu3t 5peech R ecognition for Uπkno凹n Communica.tion Cha.nnelJ, pp.155-158, Apr_1997.. 2. S.F.Boll,“Suppression of a.coustic noise in speech us­ ing spectral subtra.ctionぺIEEE TranJ. on A5SP, VoJ. ASSP・27, No.2, 1979_. 60.5% for the cordless telephone handsets. This is caused. by the mismatched condition between the ordinary 回a10g telephone handsels and the cordless telcphone handsets.. 3. B.Atal, “Effectiveness of linear prediction cha.racteris­ tics of the speech wave for automatic spea.ker identi­. fication and verification", J_ ACOUJt. 50c. A mer., VoJ.. Table 3 shows the comparison with CMS. Whcn lhe HMM. 55, pp_1304-1312, 1974.. decomposition method is applied once in the cepstral d仔 main(indicated as HMM-SH), the phrase accuracy is de­. 4. A.Acero,. ECE Depa.rtment, CMU, Sept. 1990.. sets. ln the CMS・based testing case, the HMM phonemes are trained by the CMS-proc民sed c1ean speech data. By subtracting each cepstral me 日 value from each tesling. 5. M.G_Ra.him 回d B.-H_Ju祖g, USignal bias removal by ma.ximum likelihood estimation for robust telephone. data(a.daptation 1), the phrase accuracy is 74.7% for the. speech recognitionぺ IEEE TranJ. on Speech I1nd 'A u­. ordinary 日a10g telephone handsets, 回d 42.0% for the cordless telephone handsets. To compa..re with the result of HMM-SH, we attempt to subtract the cepstral me問。f. dio ProceJJin 9 , Vol. 4, No. 1, pp.19・30, 1996.. 6. M.Morishima, T.Isobe, N_Koizumi, UPhonetica.lly BaI­ 回ced Cepstrum Mean Norma.1iza.tion", ACOUJt. 50c.. the same a.daptation data from the testing data(adaptation. Amencl1 I1nd Acout. 50c. Jl1pl1n Third Joint Meetin 9,. 2). The phrase accuracy is 72.6% for the ordina..ry analog. pp.ll05-1108, Dec. 1996_. telephone handsets, 回d 38.6% for the cordless telephone handsels. These results show lhat lhe result of CMS is bet­ ter than that of HMM-SH(without decomposition of noise. 7. S.Nakamura, T.Ta.kiguchi, K.Shikano, uNoise a.nd room acoustics distorted speech recognition by HMM compo­. HMM) in noisy telephone channeJ.. sition", Proc. ICA55P・96, 1996, pp.69・72.. Table 4 shows lhe comparison with the matched condilion. 8_ A.P.Varga., R.K.Moore, “Hidden Ma.rkov model decom­ p osition of speech and nois♂, Proc. ICA5SP・90, 1990,. for one ordina..ry.analog telephone handsel. ln lhe cぉe of. pp.845-848.. the HMM-TELE(malched handsel) which a..re lrained by the speech through only one kind of lhe ordinary analog telepbone handset, the performance is 86.6% for the same. 9. M.J_F.Gales,. accuracy with the HMM-SHN is 80.1% for tbe same阻a10g telephone handset with one adaptation sentence. There is. nOlseぺ Proc. ICA55P-9!!., 1992, pp.233-236_ 10. M.J.F.Gales, S.J.Young, “PMC for sp明ch recogni­ tion in additive and coovolutional noise",. the di恥rence of the phrase accuracy between the HMM decomposition a.nd the matched condition.. CUED-F­. lNFENG-TRI54, 1993.. 11. F.Marlin, K.Shikano, Y.Minami, URecognition of noisy. speech by composition of hidden Ma..rk ov modelsぺ. CONCLUSION. We have evaluated the performa.nce of the model adap­ ta.tion by the previously proposed HMM decomposition metbod[I) on the telephoロe speech recognition.. Proc. EUROSPEECH-93, 1993, pp.1031・1034. 12. A. S祖ka..r 回d C.-H. Lee,. The av­. HMMs is 60.9% for tbe ordinary analog telcphone hand­. uRobust speech recogni­. tion based on stochastic matching", Proc. ICAS5P-9S, 1995, pp_121-124.. erage phrase recognition accuracy with the clean speecb. 13. V.Abrash, A.Sa.nkar, H.Franco, M.Cohen, “Acoustic. sets,祖d 19.6% for the cordless telephooe handsets. The. adaptation using transforma.tions of HMM parame­. average pbrase recognition accuracy with the CMS HMMs. tersぺProc . ICASSP・96, 1996, pp.729-732.. is. 74.7%. for the ordinary analog telephone handsets, 回d. 42.0% for tbe cordless telepho口e handsets. By the HMM. 14. Y.Mina.mi, S.Furui, “A maximum likelihood procedure for a universa.l ada.ptation metbod based o n HMM com­. decomposition metbod, the average phrase recognition ac­ curacy is improved to. 78.1 %. for the ordina..ry analog tele­. pbone handsets,回d 50.5% for tbe cordless telepbone band­. p osition", Proc. ICASSP-9S, 1995, pp.129・132. 15. Y.Mina.mi,. sets. These results sbow tbe HMM decomposition method is able to improve the performance.. However,. in tbe. matcbed condition, tbe average pbrase recognition accu­ racy is. 77.7%. for the ordina..ry a.nalog lelephone handsels,. 15 10 126. S.J.Young, “An improved approacb to. the hidden Ma.rkov model decomposition of speech and. ordinary日a10g telephone handset. However, the phrase. S.. "Acoustical and Environmental Robustness. in Automatic Speech Recognition", Ph.D Dissertation,. creased to 68.6% for the ordinary analog telephone hand­. S.Furui, “Adaptation method based. 0目. HMM comp osition and EM a1gorithm", Proc. ICASSP96, 1996, pp.327-330..

(5)

Figure  1:  Recording  condition  of  telephone  speech
Figure  2:  Log  power  spectrum  /u/
Figure  4:  Parameter  estimation  by  HMM  decomposition

参照

関連したドキュメント

[18] , On nontrivial solutions of some homogeneous boundary value problems for the multidi- mensional hyperbolic Euler-Poisson-Darboux equation in an unbounded domain,

Kilbas; Conditions of the existence of a classical solution of a Cauchy type problem for the diffusion equation with the Riemann-Liouville partial derivative, Differential Equations,

Since the boundary integral equation is Fredholm, the solvability theorem follows from the uniqueness theorem, which is ensured for the Neumann problem in the case of the

Next, we prove bounds for the dimensions of p-adic MLV-spaces in Section 3, assuming results in Section 4, and make a conjecture about a special element in the motivic Galois group

Transirico, “Second order elliptic equations in weighted Sobolev spaces on unbounded domains,” Rendiconti della Accademia Nazionale delle Scienze detta dei XL.. Memorie di

To derive a weak formulation of (1.1)–(1.8), we first assume that the functions v, p, θ and c are a classical solution of our problem. 33]) and substitute the Neumann boundary

This paper gives a decomposition of the characteristic polynomial of the adjacency matrix of the tree T (d, k, r) , obtained by attaching copies of B(d, k) to the vertices of

While conducting an experiment regarding fetal move- ments as a result of Pulsed Wave Doppler (PWD) ultrasound, [8] we encountered the severe artifacts in the acquired image2.