• 検索結果がありません。

A Non-Iterative Model-Adaptive E-CMN/PMC Approach for Speech Recognition in Car Environments

N/A
N/A
Protected

Academic year: 2021

シェア "A Non-Iterative Model-Adaptive E-CMN/PMC Approach for Speech Recognition in Car Environments"

Copied!
4
0
0

読み込み中.... (全文を見る)

全文

(1)A NON-ITERATIVE MODEL-ADAPTIVE E-CMN/PMC APPROACH FOR SPEECH RECOGNITION IN CAR ENVIRONMENTS M.S加zakai, S. Nak.amura and K. Shikano Graduate School of Infonnation Science Nara Institute of Science and Technology Ikoma, Nara, 630-01 Japan E・mail:(m-shozak I nakamura I shikano )@包必st-nara.ac必. ABSTRACT. E-Crv白河I/PMC approach is re必国d by combining E・CMN and. PMC. We present results obtained企om u鈎19 E・CMN paper. TIÙS. investigates. the. Cepstrum. Mean. Nonnalization(CMN) which has been widely acknowledged perfonnance. of. usual CMN is. 1ir凶ted. because the. E­. MODELING MULTIPLICATIVE. 2.. useful for compensation of multiplicative distortions. However, the. and. C恥倒/pMC for sp白ch recognition tasks in car environments.. DISTORTION AND ADDITIVE NOISE. nonnalization by a single cepstrumロ1e佃V出tor is not enough to cor可>ensate many factors of multiplicative distortion in real environments. To solve this problem, a new method E・CMN is proposed. The method estimates two cepstrum mean vectors, one for speech and the other for non-speech for each sp伺ker capable. of. cor甲ensating. various. kinds. of. multiplicative. distortion collectively to normalize input spectra. Furthenrore, a new model-adaptive approach E・C恥1N/PMC, based on E­ CMN and HMM composition, is proposed for environments with additive noise and multiplicative distortions. TIÙS r田thod is simplified in a sense that it is possible to add speech models 卸d an additive noise model without any iterative 0戸rations. Matching gains for a11企'equency bands of speech models to the. 合明uency. ω. S( ω;t ) of. at time t in the s戸ech frame is called speaJcer. personality叩d is defined as. Hp叫. } l 'e・、. 佃d subttacts them 企om an input cep町um This π1ethod is. The long-term average of short-term spectra. where T is a sufficiently large nátural number. The speaJcer. personality may be considered to represent a frequency characteristic which depends on the speaker's vocal官act and vocal cords.百le normal包ed speec h s戸ctra is defined as. S'(叫t) =ぷ叫t) /H p".tHl�ω). (2) The short-time spectra S( ω; 1) is interpreted as the generated. noise model are uniquely est泊四ed as a cepstrum me加V民tor. output when the normalized speech sp即位a S・(ω;t ) passes. for speech. The performance of E・CMN/PMC in adverse car. through a time-invariant filter of gain. environments is finally evaluated.. multiplicative distortion to s' (ω;t ). We may find three kinds. HPu_ (ω). which is a. of multiplicative distortion for S・(ω;1) in addition to the. Hpus"" (ω)泊reality[l]. (1 )speaking style HSIJI.(N) (ω). INTRODUCTION. 1.. Needs of speech interface for facilities like car navigation systems,制d for mobile computing de吋ces like Personal Digi凶 Assistants,. 紅e. stim叫at泊g. research. and. development. into. S戸配h r,民ognition t配加ologies in adverse environments. The drast記drop in performance that occurs in real environments is widely acknowledged to be due to multiplicative distonion and additive. noise.. For. additive. noise,. such. as. HMM. frequency characteristics. peculiar to spea財ng styles(sp伺d. loudness, Lom凶rd e釘ect etc.) which釘e affected by 加additive noise. and. (2)aco凶tical transmission characteristics HT,WIS (ω). composition[7][8]. have. been. If we assume出at叩鴨ch and noise ar芯additive in the linear sμctrum domain, the observed spectra O( ω; 1) is modeled as. q叫市H・(ω). S'(ω;t) +政ω ;t). compensation of multiplicative distortions, and proposes E­. H'(ω) =HMït(ω)-Hr,alU(ω')-HStyl(N)仰)'HPm.四(ω). for. compensation. spat凶. characteristics of microphone.. propo鈎d.百由paper proposes E-Crv町(ElUlct CMN) for crv別/PMC. :. 企equency characteristics合om IDOuth to microphone, ar凶 (3)microphone cJuuacteristics : H Mic (ω) 合問uency. speech-enhancement. approaches such as sp即位um subtraction and model-adaptive approaches. :. of凶th. m凶tiplicative. distortions and an additive noise. 百le E・Crv1N has two steps : 卸estimation step to calculate one cepstrum mean vector for. Ñ(ωな)=H MJ...ω)'N(ω;t) where. N(ω;t). (3) (4). (5). is an en泊onmen凶姐ditive noise.. speech合ames for each speaker and another cepstrum me剖. 3.. vector for non-sp回ch合ames for each environment, and a. E-C島fN. normalization step to sub位act cepstrum mean v民tors仕om the input ceps汀um v巴ctors. The new non-iterative model-adaptive. 百le. CMN[2] has been widely used for compensating. ESCA. Eurospeech97. Rhodes, GreecιISSNI018-4074, Page 287. 75.

(2) +ー. Table 2 Combinati一一ーーー一一一ー 一一一一一一一 ー一一一一・ Distance Seat Position 勘fic. Position Combisun-visor at X: front nation A ( cm ) Y : intennediate driver's seat B : sun-visor at Z: rear 民at beside合iver 26 X A 1 2. A. Y. 34. 3. A. Z. 42. 4. B. X. 66. 5. B. Y. 71. 6. B. Z. 76. microphone characteristics. Recently, it has been suggested that calculating cepstrum mean vectors sep釘ately for speech and non-spe配h gives better performance than calculating one common cepstrum me叩vector[3][4]. Equ.(3) just節目的 conclusion, because the multiplicative distortion in speech. classi命foぽvariations of CMN in Table 1. Utterance-based ロlCans that the ceps町um is caIculated utterance by utterance. Speaker-based means that the cepstrumロte卸is calculated企om sufficient lengths of each speaker's spe配h. recognition task is speaker-泊dependent 520 Japanese words with 54 context-independent tied-mixture HMM mod巴Is which紅'C derived from a speech database(ATR Database C set) for 40 spe也:ers. The acoustic analysis uses 8kHz sampling,. τne. 32ms加国length and 2;伽15合ame shift.百巴p紅白田ters are 10 MFCC(Mel抗"Cquency. Cepstrum白efficient)s, 10. Delta. MFCCs and Delta energy.百le number of shared Gaussian. distributions 8I芭256, 256回d 64 resp配tively. S以泊1pu1se responses, from d町mny head(Head and Torso Simulator TYPE4128 by B&K Inc.)'s rrouth to omni-directional mαophone,町lCasured in a car cabin by TSPσime S汀etched PU凶) metb叫[5) are convoluted with the evaluation data(2 m此s and 2 females) as HTraru (ω). Impu1se responses are 注目S町ed for 6 combinations of Table 2泊a car environment shown in Fig.l . Fig.2 shows those measured防司pulse responses. Driver' s Seat front intermediate rear. X • Y . Z •. Fig. 1 Positions of凶crophone and driver's seat.. ;;:::ト一一Com佃mh…………ibi………i川inat削a副t 一十 一 ; ;ト. j i ;. ……………. つ l. 山… t 2 i川川 叫山山川 i on 川 a削山 nm. 川. -J:i;ト一一 | -記:i記;i:}ト一十一一一 … , .' 0 0 t i a i 山 ; 川 凶 … b 削 m o C 一 一 � I -:j; : トふ -jj;::ト一一一…… ion 3. :…………o畑叫仙…m曲山山b凶山n削t. t i no … 9 … 1… 川0…川 …. 11. Fig. 2 M回Surl凶Imp凶se Respon犯s. which 釘e normalized白血at maximum amplitude for combination 1 is equal to 1.0. No additive noise wぉadd凶to the evaluation data in this simulation.ηle speech and non­ spe出h frame釘e det民ted by enhanced voice activity detection algorithm based on [6]. 百le r'配ognition performances of no C:MN, CMN(Type 1) and CMN(Type 2) 紅e shown in Tab1e 3. The recognition 戸rfoロnances for C島町σype 3) and CMN(Type 4) are shown in Table 4 and Table 5 respectively. For CMN(Type 3)組d. C:MN(Type 4), the cepstrum ræans were av創刊:ed over 520. words, 50 words, 10 words and 5 words. We found that:. (l)C:MN(Type 4) wぉthe IlX>st effi回目.ve.百le cepstrum mean as a product of various multiplicative distortions can be es泊四ted. Page288. 76. 山山 山山 o 酌山C 的 叫 叫. -EA. 合ames and 出e multiplicative distortion in non-speech frames are different. At the same t泊lC, equs. (3),(4) suggest that CMN can be inter戸'Ctedぉa me出od to nonnalize abso1ute s戸田h spectra by the product of the four kinds of multiplicative distortion. It・should be notedthat speaker personality cannot be isolated from the product of mu1tiplicative distortions. Furthermo偲,白e speaker personaliりI estimated企om short utterances may v訂y depending on the phoneme balance. For these reasons, a new C恥町σype 4) method: calculate two cepstrum mean vectors -- one for speech仕'ames in su節ciently10ng utterances, and the other for non-speech frames -- for each speaker separatelyぉems to give better performance. We. Idi crop加問 A sun-visor at dri時r's田at sun-visor at seat以ョside driver B.

(3) Table 3 Recognition perfoπnance for no CMN, 一一- ー .._�. .・ ・� -,.. 4.. no C:rv創 C:r-.征N 1'yJle 1 CMN T盟主e 2. Combination. 80.1. 90.8. E.CMN/PMC. Various model-adaptive approaches for multiplicative distonion. 88.4. and additive noise were泊vestigated. The most typical one is 釦. 2. 75.9. 90.2. 88.0. HMM coπ1pOsition methodsuch as NOVQ[7] and PMC[8]. 3. 72.3. 89.7. 87.1. their derivatives. The 1弘⑪r1 con司position method uses equ. (6). 4. 82.9. 87.9. 83.9. to adapt clean HMM IIX>dels to adverse environments with the. or. 5. 83.2. 87.7. 82.1. estimated multiplicative distonion回d the estimated additive. 6. 80.5. 87.3. 81.9. noise IIX>del.. average. 79.2. 88.9. 85.2. 0(ω;t)=H(ω)・S(ω;t)+jV(ω; t). (6) We assume that the additive noise nx>del c釦 be es山1ated in advance. Then. the multiplicative distonion is est町四回byML. I�bl�4_Recognition Performances for CMN(Type 3).. Combination. 5w. lOw. 50w. 520w. 92.6. 92.4. 93.0. 93.3. 2. 92.3. 92.2. 92.8. 92.9. 3. 91.0. 91.5. 9 1 .8. 91.7. 4. 89.8. 89.5. 90.5. 90.6. estlJIlatlon usmg. 3=釘g. N) H p似[ÁOrI .M s.M ]. (7). where Or. Ms, MN釘'e the observed line釘 spectra, the clean. 5. 89.3. 89.5. 89.4. 89.6. 6. 88.2. 88.3. 88.3. 88.5. average. 90.5. 90.6. 91.0. 9 1.1_ j. speech sp民国 IIX>del and the additive noise spectra nx>del respectively. To 叩lve equ.(7). a st閃pest descent method[9] and EM. 一ーーーー ー'ーーー ーー.一一- -ーーーーーー. ---曹 ーーー. 一一.. -. algorithm{4]l[ O] were investigated. This paper proposes a non­. 喧 F ・,.. Combination. 5w. 10w. 50w. 92.3. 93.3. 93.0. 93.1. 2. 92.9. 93.4. 93.9. 93.8. 3. 92.0. 92.9. 93.2. 93.0. 4. 90.4. 91.4. 91.2. 91.4. 520w. 5. 89.6. 90.1. 90.3. 90.3. 6. 89.2. 89.6. 89.4. 89.8. average. 91.1. 91.8. 91.8. 91.9. E-Ctv制.. By. nta対ng. three. multiplicative distortion by. I. performance than. H(ω1) =. δ(ω) - Ñ(ro) S(ω). (8) where O(ω)・ Ñ(ω) , S(ω) are long-term averages of the. observed spectra. the. mean v,配tor of single di町ibution泊MN• and the mean vector of single他国bution in Ms respectively.. Equ.(6) 卸d (8) lead to as fol1ows:. CMN(Type 4)加this evaluation task. where all speech data used in this ex戸市町t have 巴qually. by. independent on variances of Ms, MN釦d (3)order-dependent. variation in seat 釦d miCTophone slightly worse. normalization. on. op山泊zation of equ.(η is feasible, we c卸 estimate the. posltlons. 3) gives. sp田tra. method based. Gaussian distribution HMM, (2)the multiplicative distonion is. (2)With no additive noise, high recognition performance was. (3)CMN(Type. model-adaptive E・CMNIPMC. assumptions: (1)Ms. MN c佃be modeled as one state/one. accurately with around 10 words_ obtained問g釘dless of. iterative. ー. ー. H'Cω). (4)百le伊or performance for CMN(Type 2) is given because. S'(ω;t) = S(ω;のIS(ω). unbalanced phoneme dis出bution varies a lot utterance by utterance.. =. (9). ð(ω,) - M:ω}. intervals at the begiru由g of and at the end of speech interv山. the ceps町um me卸 calculated合om short word utterance with. 旬. 0(0司t)=H-仰� 'S・(叫t)+N(叫t). 250ms non-speech. (1的. (11) s・(ω; t) is equal to the normalized s戸ctra in equ.(2). The best norntalization for multiplicative distonion was re必ized by E­ CMN as stated before. Equ. (9) suggests that once the HMM. We rename CMN(Type 4) as E・CMN印d summarize the. models are佐山吋from 出e normalized cepstrum convened. algorithm as fol1ows:. 企om normalized sp即位um byE・CMN, we can叫apt the HMM. (Estimation Step). for each speaker. One, obtained 仕om speech 企ames of. IIX>dels to 飢y adverse envirorunents using the estÏntated multiplicative distortion H. (ω) and the estin:凶ed additive. sufficiently-long utterances, is speaker-dependent. The other.. noise MN・ Fig. 3 briefly describes the algorithm of the E・. :. Two cepstrum me釦 vectors are calculated. obtained企om non-speech仕留nes. is environment-dependent.. (Normalization Step) : The speaker-dependent cepstrum me卸. for s戸ech is subtracted 企om the input cepsηum vector in. CMNIPMC method. We note here that the multiplicative distonion is obtained as出e cepstrum mean v配tor for speech. frames by E・CMN(Estimation Step). The advantages of thisE・. speech frames. The environment-dependent ceps汀um mean for. CMN/PMC method over other algoritluns[4][9][lO]釘e as. non-sp田ch is subtract副会om the input cepstrum vector in non­. fol1ows:. sp田ch fr沼田5・. (l)More accurate estimation of江川ltipücative distonion from. around 10 words is possible byE・CMN(Estimation Step). Page289. 77.

(4) 0'0. RecolØ1ition Phase. Trainilllt Phase:. 8'0. 22. 8 SNR(d6). Fig. 4 Comparison of E・CMN method and PMC method. 9'0 80. Fig. 3 E・CMN/PMC method. (2)N'O iteraúve '0戸raúons are required t'O adapt the HMM nxxiels, which釘e derived合'Om汀尚ning with n'Onnalized speech. The rnatching gains(multiplicative dist'Oni'Ons) f'Or all企equency 凶nds 'Of 1晶品,f DX><iels t'O additive n'Oise rrodel are u凶quely estimated邸S戸aker-dependent cepstrum mean v民t'Or by E・. 29. CMN(Estimation Step).. 15. 22. 8. SNR(d6). Fig. 5 Perf'Ormance 'Of E-CMN/PMC.. We investigate tW'O variati'Ons 'Of E・C恥1N/PMC. (1)E・CMN(cl回n)/PMC : The cepstrum mean is calculated企om 10 w'Ords witoout additive n'Oise. Accurate estirnati'On 'Of. We would like t'O thank many indi吋duals in Speech and. multiplicative dist'Oni'On is possible.. Ac'Ousúcs Laborat'Ory 'Of Nara. (2)E・CMN(n'Oisy)/PMC : The cepstrum mean is calculated from. Techn'Ol'Ogy f'Or useful discussi'Ons and suggesti'Ons.. Institute 'Of. Science. and. 10 w'Ords wi出回ditive noise. N'O addiúve n'Oise cancellati'On is d'One.. τne. estimati'On. 'Of. multiplicative. dist'Oni'On. 7.. is. REFERENCES. c'Ontaminated by additive n'Oise. [1]. A. Acero, "Aco凶tical and Environmental Rob凶tness. 百le recogniúon task is the same as出e previ'Ous one. The. in Autonωtic Speech Recognition", Kluwer Acadenñc. imp凶se response 'Of Combination 1 in Table 2 is c'Onv'Oluted. Publishers, 1992.. with evaluati'On data(2 rnales and 2 females) as a multiplicative. [2]. 曲t'O出'On. N'O問rec'Orded in a car cabin was added t'O the. SpeaJcer Verification". IEEE Trans. ASSP-29, pp.254・272.. S. Furui, "Cepstral Analysis Technique for Automatic. evaluati'On data with SNR 29dB, 22dB, 15dB and 8dB. The. [3]. rec'Ognition perfonnance using on1y E・CMN, on1y PMC are. and M. Mahajan. "Microsoft Windo附Highly lntelligent. shown in Fig.4. The recog凶tÎOn perfonnances using E­. Speech Recognizer:Wisper", Proc. ICASSP. Detr'Oit, 1995.. X. Huang,A. Acero, A. Alleva,M. Y. Hw卸g, L. Jiang. CMN/PMC are shown in Fig.5. The rl民'Ognition performance. [4]. f'Or n'O adaptation is also soowo泊Fig.4 and Fig.5. These results. Based on Stochastic Matching", Proc. ICASSP, Detroit, 1995.. sh'Ow that (I)E-CMN 'Outperf'Orms PMC at higher SNR, and (2)E・CMN(noisy)/PMC has w'Orse performance than E・ CMN(clean)/PMC at lower SNR.. N. A'Oshima,“Conψuter-generated puJse signal叩'plied [5] for soundme出urement", J. Ac'Oustic. Soc. Am 69, 1484-1488, 1981.. s.. [6]. A. S創1ker and C. H. Lee, "Robust Speech Recognition. Rec'O mrnendati'On GSM 06.32.. F. Martin, K Shikan'O阻d Y. Minan払"Recognition of [7] Noisy Speech句Composition of Hicúたn Mar/cov ModelsヘProc.. CONCLUSION. Euros戸ech, pp.l 031・1034,1993. We have proposed an E・CMN consisting of two steps,回. [8]. M. Gales 卸d S. Y'Oung, "Cepstrum Parωneter. estirnati'On step t'O calculate each speaker's cepstrum me卸. Compensation for HMM Recognition", Speech C'Ommunication.. vect'Ors f'Or sp民ch frames and non-speech fraIreS sep紅ately,. v'0112, n'O.3,pp.231・239,1993.. 釦d a nonnalization step t'O subtract these vect'Ors合om the. [9]. input ceps 官官立 Moreover. a new rrodel-adaptive E-C恥1N/PMC. Procedure for a Universal Adaptation Method B国'ed onHMM. approach is. Composition", Pr'Oc. ICASSP, pp.129・132,1995.. propoぉd. and. evaluated. f'Or. task in car environments.. rec'Ognition. [10]. Y. Min祖語and S. Furui, "A Maxim肌Like[jhood. Y. Min白羽叩d S. F町ui,"Adaptation Method Based. onHMM Composition and EM AlgorithmヘProc. ICASSP,. 6.. ACKNOWLEDGMENT. pp.327-330, 1996. Page290. 78.

(5)

Table 2 Combinati  一一ーーー一一一ー 一一一一一一一 ー一一一一・
Table  3  Recognition perfo πnan ce for no CMN,
Fig. 4 Comparison of E・CMN method and PMC method.

参照

関連したドキュメント

This section contains a result of Lascoux, Leclerc, and Thibon [6] which ties the plethysm of power-sum symmetric functions and Schur symmetric functions to Kostka polynomials

We present the optimal grouping method as a model reduction approach for a priori compression in the form of a method for calculating an appropriate reconstruction layer profile for

We construct a Lax pair for the E 6 (1) q-Painlev´ e system from first principles by employing the general theory of semi-classical orthogonal polynomial systems characterised

The Beurling-Bj ¨orck space S w , as defined in 2, consists of C ∞ functions such that the functions and their Fourier transform jointly with all their derivatives decay ultrarapidly

AHP involves three basic elements: (1) it describes a complex, multicriteria problem with objective or subjective elements as a hierarchy; (2) it estimates the relative weights

This paper presents a new wavelet interpolation Galerkin method for the numerical simulation of MEMS devices under the effect of squeeze film damping.. Both trial and weight

In this paper, for the first time an economic production quantity model for deteriorating items has been considered under inflation and time discounting over a stochastic time

It is suggested by our method that most of the quadratic algebras for all St¨ ackel equivalence classes of 3D second order quantum superintegrable systems on conformally flat