• 検索結果がありません。

Speech Recognition For a Distant Moving Speaker Based on HMM Composition And Separation

N/A
N/A
Protected

Academic year: 2021

シェア "Speech Recognition For a Distant Moving Speaker Based on HMM Composition And Separation"

Copied!
4
0
0

読み込み中.... (全文を見る)

全文

(1)SPEECH RECOGNITION FOR A DISTANT MOVING SPE AKER BASED ON HMM COMPOSITION AND SEPARATION T. 1'1αkiguchit, S. Nαkαmurat, K. Shikαnot tIBM Tokyo Research Laboratory, 1623-14, Shimotsuruma, Yamato-shi, Kanagawa, 242-8502, JAPAN tGraduate School of Information Science, Nara Institute of Science and Technology 8916-5, Takayama-cho, Ikoma-shi, Nara, 630-0101, JAPAN. [3,4]). For the model compensation approach,the con・ ventionaJ multi-template technique, model adaptation (eι [5,6]) and model (de-)composition methods (eι [1, 7,8, 9, 10]) have been developed We applied HMM composition to the recognition of speech contaminated not only by additive noise but also by the reverberation of the room [1]. \Ve aJso proposed HMM separation for estimating the HM乱1 parameters of an acoustic transfer function [2]. The model parameters are estimated by maximizing the likelihood of adaptation data uttered from an unknown position. This paper desc1'ibes the performance of the HMM composition and separation for recognition of the speech of a distant moving speaker. The speech of the distant moving speaker is recognized by using an ergodic HMM of acoustic transfer functions. Each state of the ergodic HMM of acoustic transfer func­ tions corresponds to a position in a 1'oom, where all t.ransitions among states are permitted. Therefore,the ergodic HMM of acoustic transfer functions is able to trace the positions of sound sources. First, we give a brief overview of HMM composition [1]. Following this,we 白scribe a method for estímating the HMM parameters of the acoustíc transfer functíon, based on HMM separation [2]. We also describe the performance of HMM composition and sepω'ation for the speech of a distant movíng speaker.. ABSTRACT. This paper describes a hands-free speech recognition method based on HMM composition and separation for speech contaminated not only by additive noise but also by an acoustic transfer function. The method re­ alizes an improved user interface such that a user is not encumbered by microphone equipment in noisy and re­ verberant environments. In this approach, an attempt is made to model acoustic transfer functions by meωlS of an ergodic H:rvlM [1]. The states of this HM:tvl cor­ respond to different positions of the sound source. It can represent the positions of the sound sources, even if the speaker moves. The H:rvl]\!l parameters of the acoustic transfer function are estimated by HMJ\.I sep­ aration [2]. The method is obtained thro喝h the reverse of the process of HMM composition, where the model parameters are estimated by maximizing the likelihood of adaptation data uttered frol11 an unknown posit.ion. Therefore,measurement of impulse responses is not re­ quired. In this paper,we record the speech of a distant moving speaker in reaJ environments. The results of experiments for the speech of a distant moving speaker clarified the effectiveness of HMM composition and sep­ aratíon. 1.. INTRODUCTION. In hands-free speech recognition, one of the key issues regards practical use is the developm't�nt of a tech­ nology that allows accurate recognition of noisy and re­ verberant speech. Many methods have been presented for solving problems caused by additíve noise and con­ volutionaJ distortion in robust speech recognition. Two common examples of such methods are the speech en­ hancement and model compensation approaches. For the speech enhancement approach, spectraJ subtrac­ tion for additive noise and cepstral mean normalization for convolutional distortion have been proposed (e.g.,. 2. HM恥,f COMPOSITION FOR NOISY AND. REVERBERAN T SPEECH. as. 。7803-6293-4/00/$10.00 @2000 IEEE.. The observed speech in a noisy.and reve1'berant room is represented by. 。(ω;m). =. S(ω;m) .H(ω;m) + N(ω;m),. where O(ω;m), S(ω;m), H(ω;m), and N(ω;m) are short-term linear spect1'a for observed speech, clean speech,an acoustic transfer function,and noise in the analysis windowη1,respectively.. 1403. 213.

(2) HMIVl composition is applicable if two stochastic information sources are additive. To apply H]'vlM com­ position,the equation can be rewritten as follows:. O(ωjm). =. Observed speech model. ιL. ノミどぞγぞ:ごミ土て寸!:C:寸;:ごご::7a訂r γr川( ÀSH+川川H仙M川川+州川州N川州|いλ5刊H. exp( cos(Scep(tjm) +H仰 ( t j m)))+N(ω;m),. (1) where Scep(tjm), and Hcep(tjm) are cepstra for the clean speech, and the acoustic transfer function of que­ frency t in the analysis window m. Accordingly, a com­ posed Hl\.flvl of the observed speech in the cepstral dか main is represented by. where 入represents an associated H]'vlIvl model,and the suffixes of cep and lin represent the cepstral domain and the linear-spectral domain,respectively. Cos, Log, and Exp are the cosine transform, logarithm transform, and exponential transform of the Gaussian pdf, respec­ tively. To adjust the signal-to-noise-ratio (SNR),a cか efficient, k,is used, and EB denotes the model composi­ tion procedure. The Hìvll\:1 recognizer decodes observed speech on a trellis diagram by maximizing the log-likelihood. The decoded path will find an optimal combination of speech, noise, and the acoustic transfer function.. 幻5凶 t 町D 加M… 也巴巴ffHJ凶l. U: 以. J. 白川 =argmax Pr (入S+H I ÀH. Às). Noise model. /-"-..._. 入Oc叩= Cos-1[Log{Exp(Cos(入Scep EB入Hcep))EBk入Nlin}],. I 山 r Clean speech model. 2.1.. 3.. ESTI恥1ATION O F THE ACOUSTIC. TRANSFER FUNCTION ON THE BASIS O F H恥1M SEPARATION. =. cos-1 [Iog{ exp(cos( Ocep(tj m))). The estimation equation of the acoustic transfer func­ tion HMM is written as. 入Hcep. =. Cos-1[Log{ Exp(Cos(入Ocep))θ入Nlin}]e入SC�ド. where the separation of HMMs is represented by thE'θ operator. This equation shows that HMM separation is ap­ plied twice to noisy and reverber加t speech. First, HMM separation is applied in the linear-spectral do­ main to estimate the distorted-speech HMMs by l\IL estimation. Then, the distorted-speech HMMs are COll­ verted to the cepstral domain, and HMM separatIOll is applied again in the cepstral domain to estimate the acoustic transfer function HMM by ML estimat.Jο11・ Figure 1 illustrates HMM separation.. S 、A N 、A H 、A O. a. 哨 m A FD a 一 一. H 、A a. 4. EXPERI恥1ENTS AND RESULTS. Experimental Conditions. The recognition algorithm is based on tied-mi:てt.un' dí­ agonal covariance HMMs. Each HMM has tluee st.日tes and three self-Ioops. The moclels of 55 contex t-indl"pl'll­ dent phonemes ωe trained by usi時 about 9600 �l'n­ tences uttered by 64 spe心<ers, which are contaiIled 111. 1404. 214. Acoustic transfer. -N(ωjm)}]- Scep(tjm). 4.1.. Here,入denotes the set of Hl\Hvl parameters,while the suffixes of 5, N,and H denote clean speech, noise,and the acoustic transfer function. The observed speech is now represented by equa­ tion (1). Accordi時Iy, the acoustic transfer function is. i:. represented by. px. hおdel parameters are estimated in an I'\IL manner by using the expectation-maximization (EM) algorithm, which maximizes the likelihood of the observed speech:. I? �HI. Figure 1: Illustration of model separation. The com­ posed HMM is separated into a known HMM and加 unknown HMM.. Modeling of the Aco ustic T旨ansfer Func­. Figure 3 shows the acoustic transfer function HMM in the case of three states. Each state of the acoustic transfer function HI\:1孔l corresponds to a position in a room, and alJ transitions among states are permitted Therefore, the acoustic transfer function HMM is able to represent the positions of sound sources, even if the speaker moves.. 入H. function ll10del. Hcep(tj m) tion. λ川川 Nρ).

(3) 5m80. 凶 凶 d = 品. 寸 戸 門 C. Table. 仁コ! Mic.. 0'. Moving direction. £3dz主. Starting. White-board. position. Table 2:. Phrase accuracy. [%]. spe心<er. Figure 2: R.ecording conditions for the speech of a dis­. 「. tant moving speaker. for a distant moving. 11. Models. Phrase accuracy. Clean-speech HMMs Parallel models. (gl, g2, g3) (gl, g2) Ergodic HMMs (gl, g3 ) Ergodic HMMs (g2, g3). Ergodic HMMs. Ergodic HMMs. ムふ�. 63.3 76.7 82.3 78.6 76.3 80.0. 1. Phonel1le HMM. ρ ・』 、 J ,M ト' 戸M 川刷 コ \ ) d h ' az J 自' 」司 1 、 l n 一戸、 JL z 、 配 戸 戸 d f t ' '- u o ; i で ーコ α ぽ 〆 h uy c f れ A 、 r. f1'om the “starting position" shown in白gure 2. speaker utters 31 sentences while moving.. The. We also. record the speech of a distant stationary speaker f1'om the positions of sound sources gl, g2, and g3. One sen­ tence is used for estimation of each acoustic transfer function. Figure 3 shows the composed ergodic Hi\.1M in experiments.. Figure 3:. Example of a composed ergodic HMM in. experiments with a distant moving spe心<er. 4.2.. Experimental Results. The points to be investigated are the performance of: •. the Acoustical Society of .Japan (ASJ) continuous-speech database.. Parallel models of acoustic transfer functions: Composed HMMs for each acoustic trans­. The speech signal is sanlpled at 12 kHz and win­. fer function (each position) are individ­. dowed with a 32・msec Hamming window every 8 msec.. ually set.. Then FFT is used to calculate 16-order MFCCs (mel­. Likelihood scores for their. composed HMMs are calculated, and. frequency cepstral coefficients)釦d power. ln recogni­. composed HMMs having the maximum. tion, the power term is not used, because it is only. likelihood are then selected .. necessary to adjust the SNR. in HMM composition. Sixteen-order MFCCs with their first-order differentials. •. (ムMFCC), and first-order different1àls of normalized. Ergodic models of acoustic t1'ansfer functions. logarithmic ene1'gy ム ( power), are ca'!culated as the ob­. A ph1'ase recognition experiment was carried out. servation vector of each frame. There are 256 Gaussian. for continuous-sentence speech, in which the sentences. ll1ixture components with diagonal covariance matrices. included 6 to. shared by all of the models for MFCC andムMFCC,. 306 phI国es with a phrase perplexity of 306. The phrase. respectively. There are 128 Gaussian mixture compo­. accuracy for a close-talking microphone was. nents shared by all of the models forムpower. A sin­. 7 phrases on. gle Gaussian is employed t.o model an acoustic transfer. stationary speaker. speech HMMs was. Figure 2 shows the recording conditions fo1' the speech One male is walking. 90.4%.. Table 1 shows the phrase accuracy for a distant. function. of the distant moving speaker.. average. The task contained. The phrase accuracy with clean­. 69.5%.. Next, we compose the clean­. speech HMMs and each of the acoustic transfer func­ tion HMMs, gl, g2, and g3. The performance of the. 1405. 215.

(4) 〉、 I二.c個 三日品 c 三0.4 5 』. composition and separation in noisy and reverberant. 三0.8. トー. environments, where a user speaks from a distance of 0.5 m - 3.0 m.. The aim of the HMìvl composition. and separation methods is to estimate the model pa­ rameters so as to adapt the model to a target environ­. 0.2. ment by using a small amount of a user's speech. In 1.6. 24 Tímc. 1.2. 1叫じ|. this approach, an attempt is made to model the acous­ tic transfer functions by means of an ergodic HMM. The states of the acoustic transfer function HMtvl cor­. Figure 4: Estimated transition probabilities from the. respond to different sound source positions. This Hrvll\:l. initial state to each state.. can represent the positions of sound sources, even if the speaker moves. This paper investigated the performance of HM�1. parallel models, where composed HMMs having maxi­ mum likelihood are selected, is 76.5% on average. The performance of the composed ergodic HMMs (shown ín figure 3) is 75.5% on average.. Comparison of this re­. sult with that for the parallel model shows a difference in performance of 1.0%. This is because all transition probabilities of acoustic transfer functions in the er­ godic HMM are set equally, and a wrong path might. composition and潟paration for recognition of speech of a distant moving speaker. Such speech is recognized by using an ergodic HMM of acoustic transfer functions. The experimental results show that the ergodic HMM can improve the performance of sp田ch recognition for a distant moving speaker. In future work, we will in­ vestigate how to choose the number of states in the ergodic HrvlM.. be chosen. This table also indicates that the closest po­ 6.. sition, g3, results in the best performance. The greater the distance between the microphone and the positions of sound sources, the more the phrase accuracy will be. [1]. S. Nakamura, T. Takiguchi, and. K. Shikano, “Noise and. room乱coustics distorted speech recognition by HMM com­. decreased. This is because the SNR decreases. Table 2 shows the phrase accuracy for the distant. REFERENCES. position," in Proc. ICASSP, pp.. [2]. 69-72,1996. T. Takiguchi, S. Nakamura, Q. Huo, and. 1<. Shikano, "Adap­. moving speaker. The phrase accuracy with clean-speech. tation of model parameters by HMM decomposition in noisy. HMMs is 63.3%. The perform加ce of the parallel mod­. reverberant environments," in Proc. ESCA-NATO. els, where composed HMMs having maximum likeli­ hood are selected, is 76.7%.. In comparison with the. C剖e of the distant stationary speaker, the performance for the distant moving speaker is slightly better, be­ cause there were few speech data to be recorded while. nication Channels, pp.. [3] [4]. 82.3%. These experimental results show the effective­. each state: these are estimated by maximizing the like­. [G] V. Abr回h,A.. Sanl塙r, H. Franco, and M.. Proc. ICASSP, pp.. [7]. Cohen,“Acoustic. 729-7,1996. A. P. Varga and R.. K. Moore,“Hid巾n Markov model de­. composition of speech and noise," in Proc. ICASSP, pp.. 845-. 848,1990 [8]. M.. J. F. Gales and S. J. Young, “An improved approach. to the hidden Markov modcl decomposition of speech and noise," in Proc. ICASSP, pp.. [9]. to position g3 increases.. 23:3-2.36,1992.. 1<. Shikano, and Y目�Iinami, '‘Recognition of noisy speech by composition of hidden恥!arkov models," in Pro(' F. Martin,. EUROSPEECH93, pp.. [10]. 10:31-1034,1993.. Y. Minami and S. Furui, "A maximum likelihood. pr(刊Jυlρ cümpns'­. for a univcrsal adaptation methocl basecl on HM1vl tion," in Proc. ICA8SP, pp. 129-IT2,1995.. CONCLUSION. This paper has detailed a robust speech recognition technique for acoustic model adaptation based on Hl\Hvl. 1406. 216. 121-1:2"1,. adaptation using transformations of HMM parameters," in. state to position gl is highest in the first interval. The. 5.. pp.. 1995. position g3, the transition probability from the initial more time elapses, the more the transition probability. 1990. A. Sankar and C-H. Lee,“Robust speech recognition based. on stochastic matching," in Proc. JCASSP,. lihood of one sentence of testing data every 0.8 msec. As the testing speaker is walking from position gl to. A coω tica l and Environmentα1 RobustTless in A11-. partment, CMU, Sep.. [5]. of the distant moving speaker. Figure 4 shows the esti­ mated transition probabilities from the initial state to. A. Ace 爪. 1979. t omαtic Speech Recognition. Ph.D Dissertation, ECE Dc. tic transfer functions at gl, g2, and g3 is impl'Oved to. ness of the ergodic HMMs for recognition of the speech. 155-158, 1997.. S. F. Boll,“Suppression of acoustic noise in speech using spectral subtraction," IEEE, ASSP-27, No.2,. the distant moving speaker was in the vicinity of gl. The performance with the ergodic HMMs of aεous­. Work­. shop on Robust Speech Recognition for Unknown Commu­.

(5)

Figure 3 shows the acoustic transfer function HMM in  the  case  of  three  states.  Each  state  of  the  acoustic  transfer  function  HI\:1孔l corresponds  to  a  position  in  a  room,  and  alJ  transitions  among  states  are  permitted  Therefore,  t
Figure 2 shows the recording conditions fo1' the speech  of  the  distant  moving  speaker
Figure  4:  Estimated  transition  probabilities  from  the  initial state  to  each  state

参照

関連したドキュメント

Segmentation along the time axis for fast response, nonlinear normalization for emphasizing important information with small magnitude, averaging samples of the brain waves

In order to estimate the noise spectrum quickly and accurately, a detection method for a speech-absent frame and a speech-present frame by using a voice activity detector (VAD)

We concluded that the false alarm rate for short term visual memory increases in the elderly, but it decreases when recognition judgments can be made based on familiarity.. Key

patient with apraxia of speech -A preliminary case report-, Annual Bulletin, RILP, Univ.. J.: Apraxia of speech in patients with Broca's aphasia ; A

In this paper, we focus not only on proving the global stability properties for the case of continuous age by constructing suitable Lyapunov functions, but also on giving

This paper deals with the a design of an LPV controller with one scheduling parameter based on a simple nonlinear MR damper model, b design of a free-model controller based on

Because of the knowledge, experience, and background of each expert are different and vague, different types of 2-tuple linguistic variable are suitable used to express experts’

These authors make the following objection to the classical Cahn-Hilliard theory: it does not seem to arise from an exact macroscopic description of microscopic models of