Speech Recognition For a Distant Moving Speaker Based on HMM Composition And Separation
全文
(2) HMIVl composition is applicable if two stochastic information sources are additive. To apply H]'vlM com position,the equation can be rewritten as follows:. O(ωjm). =. Observed speech model. ιL. ノミどぞγぞ:ごミ土て寸!:C:寸;:ごご::7a訂r γr川( ÀSH+川川H仙M川川+州川州N川州|いλ5刊H. exp( cos(Scep(tjm) +H仰 ( t j m)))+N(ω;m),. (1) where Scep(tjm), and Hcep(tjm) are cepstra for the clean speech, and the acoustic transfer function of que frency t in the analysis window m. Accordingly, a com posed Hl\.flvl of the observed speech in the cepstral dか main is represented by. where 入represents an associated H]'vlIvl model,and the suffixes of cep and lin represent the cepstral domain and the linear-spectral domain,respectively. Cos, Log, and Exp are the cosine transform, logarithm transform, and exponential transform of the Gaussian pdf, respec tively. To adjust the signal-to-noise-ratio (SNR),a cか efficient, k,is used, and EB denotes the model composi tion procedure. The Hìvll\:1 recognizer decodes observed speech on a trellis diagram by maximizing the log-likelihood. The decoded path will find an optimal combination of speech, noise, and the acoustic transfer function.. 幻5凶 t 町D 加M… 也巴巴ffHJ凶l. U: 以. J. 白川 =argmax Pr (入S+H I ÀH. Às). Noise model. /-"-..._. 入Oc叩= Cos-1[Log{Exp(Cos(入Scep EB入Hcep))EBk入Nlin}],. I 山 r Clean speech model. 2.1.. 3.. ESTI恥1ATION O F THE ACOUSTIC. TRANSFER FUNCTION ON THE BASIS O F H恥1M SEPARATION. =. cos-1 [Iog{ exp(cos( Ocep(tj m))). The estimation equation of the acoustic transfer func tion HMM is written as. 入Hcep. =. Cos-1[Log{ Exp(Cos(入Ocep))θ入Nlin}]e入SC�ド. where the separation of HMMs is represented by thE'θ operator. This equation shows that HMM separation is ap plied twice to noisy and reverber加t speech. First, HMM separation is applied in the linear-spectral do main to estimate the distorted-speech HMMs by l\IL estimation. Then, the distorted-speech HMMs are COll verted to the cepstral domain, and HMM separatIOll is applied again in the cepstral domain to estimate the acoustic transfer function HMM by ML estimat.Jο11・ Figure 1 illustrates HMM separation.. S 、A N 、A H 、A O. a. 哨 m A FD a 一 一. H 、A a. 4. EXPERI恥1ENTS AND RESULTS. Experimental Conditions. The recognition algorithm is based on tied-mi:てt.un' dí agonal covariance HMMs. Each HMM has tluee st.日tes and three self-Ioops. The moclels of 55 contex t-indl"pl'll dent phonemes ωe trained by usi時 about 9600 �l'n tences uttered by 64 spe心<ers, which are contaiIled 111. 1404. 214. Acoustic transfer. -N(ωjm)}]- Scep(tjm). 4.1.. Here,入denotes the set of Hl\Hvl parameters,while the suffixes of 5, N,and H denote clean speech, noise,and the acoustic transfer function. The observed speech is now represented by equa tion (1). Accordi時Iy, the acoustic transfer function is. i:. represented by. px. hおdel parameters are estimated in an I'\IL manner by using the expectation-maximization (EM) algorithm, which maximizes the likelihood of the observed speech:. I? �HI. Figure 1: Illustration of model separation. The com posed HMM is separated into a known HMM and加 unknown HMM.. Modeling of the Aco ustic T旨ansfer Func. Figure 3 shows the acoustic transfer function HMM in the case of three states. Each state of the acoustic transfer function HI\:1孔l corresponds to a position in a room, and alJ transitions among states are permitted Therefore, the acoustic transfer function HMM is able to represent the positions of sound sources, even if the speaker moves.. 入H. function ll10del. Hcep(tj m) tion. λ川川 Nρ).
(3) 5m80. 凶 凶 d = 品. 寸 戸 門 C. Table. 仁コ! Mic.. 0'. Moving direction. £3dz主. Starting. White-board. position. Table 2:. Phrase accuracy. [%]. spe心<er. Figure 2: R.ecording conditions for the speech of a dis. 「. tant moving speaker. for a distant moving. 11. Models. Phrase accuracy. Clean-speech HMMs Parallel models. (gl, g2, g3) (gl, g2) Ergodic HMMs (gl, g3 ) Ergodic HMMs (g2, g3). Ergodic HMMs. Ergodic HMMs. ムふ�. 63.3 76.7 82.3 78.6 76.3 80.0. 1. Phonel1le HMM. ρ ・』 、 J ,M ト' 戸M 川刷 コ \ ) d h ' az J 自' 」司 1 、 l n 一戸、 JL z 、 配 戸 戸 d f t ' '- u o ; i で ーコ α ぽ 〆 h uy c f れ A 、 r. f1'om the “starting position" shown in白gure 2. speaker utters 31 sentences while moving.. The. We also. record the speech of a distant stationary speaker f1'om the positions of sound sources gl, g2, and g3. One sen tence is used for estimation of each acoustic transfer function. Figure 3 shows the composed ergodic Hi\.1M in experiments.. Figure 3:. Example of a composed ergodic HMM in. experiments with a distant moving spe心<er. 4.2.. Experimental Results. The points to be investigated are the performance of: •. the Acoustical Society of .Japan (ASJ) continuous-speech database.. Parallel models of acoustic transfer functions: Composed HMMs for each acoustic trans. The speech signal is sanlpled at 12 kHz and win. fer function (each position) are individ. dowed with a 32・msec Hamming window every 8 msec.. ually set.. Then FFT is used to calculate 16-order MFCCs (mel. Likelihood scores for their. composed HMMs are calculated, and. frequency cepstral coefficients)釦d power. ln recogni. composed HMMs having the maximum. tion, the power term is not used, because it is only. likelihood are then selected .. necessary to adjust the SNR. in HMM composition. Sixteen-order MFCCs with their first-order differentials. •. (ムMFCC), and first-order different1àls of normalized. Ergodic models of acoustic t1'ansfer functions. logarithmic ene1'gy ム ( power), are ca'!culated as the ob. A ph1'ase recognition experiment was carried out. servation vector of each frame. There are 256 Gaussian. for continuous-sentence speech, in which the sentences. ll1ixture components with diagonal covariance matrices. included 6 to. shared by all of the models for MFCC andムMFCC,. 306 phI国es with a phrase perplexity of 306. The phrase. respectively. There are 128 Gaussian mixture compo. accuracy for a close-talking microphone was. nents shared by all of the models forムpower. A sin. 7 phrases on. gle Gaussian is employed t.o model an acoustic transfer. stationary speaker. speech HMMs was. Figure 2 shows the recording conditions fo1' the speech One male is walking. 90.4%.. Table 1 shows the phrase accuracy for a distant. function. of the distant moving speaker.. average. The task contained. The phrase accuracy with clean. 69.5%.. Next, we compose the clean. speech HMMs and each of the acoustic transfer func tion HMMs, gl, g2, and g3. The performance of the. 1405. 215.
(4) 〉、 I二.c個 三日品 c 三0.4 5 』. composition and separation in noisy and reverberant. 三0.8. トー. environments, where a user speaks from a distance of 0.5 m - 3.0 m.. The aim of the HMìvl composition. and separation methods is to estimate the model pa rameters so as to adapt the model to a target environ. 0.2. ment by using a small amount of a user's speech. In 1.6. 24 Tímc. 1.2. 1叫じ|. this approach, an attempt is made to model the acous tic transfer functions by means of an ergodic HMM. The states of the acoustic transfer function HMtvl cor. Figure 4: Estimated transition probabilities from the. respond to different sound source positions. This Hrvll\:l. initial state to each state.. can represent the positions of sound sources, even if the speaker moves. This paper investigated the performance of HM�1. parallel models, where composed HMMs having maxi mum likelihood are selected, is 76.5% on average. The performance of the composed ergodic HMMs (shown ín figure 3) is 75.5% on average.. Comparison of this re. sult with that for the parallel model shows a difference in performance of 1.0%. This is because all transition probabilities of acoustic transfer functions in the er godic HMM are set equally, and a wrong path might. composition and潟paration for recognition of speech of a distant moving speaker. Such speech is recognized by using an ergodic HMM of acoustic transfer functions. The experimental results show that the ergodic HMM can improve the performance of sp田ch recognition for a distant moving speaker. In future work, we will in vestigate how to choose the number of states in the ergodic HrvlM.. be chosen. This table also indicates that the closest po 6.. sition, g3, results in the best performance. The greater the distance between the microphone and the positions of sound sources, the more the phrase accuracy will be. [1]. S. Nakamura, T. Takiguchi, and. K. Shikano, “Noise and. room乱coustics distorted speech recognition by HMM com. decreased. This is because the SNR decreases. Table 2 shows the phrase accuracy for the distant. REFERENCES. position," in Proc. ICASSP, pp.. [2]. 69-72,1996. T. Takiguchi, S. Nakamura, Q. Huo, and. 1<. Shikano, "Adap. moving speaker. The phrase accuracy with clean-speech. tation of model parameters by HMM decomposition in noisy. HMMs is 63.3%. The perform加ce of the parallel mod. reverberant environments," in Proc. ESCA-NATO. els, where composed HMMs having maximum likeli hood are selected, is 76.7%.. In comparison with the. C剖e of the distant stationary speaker, the performance for the distant moving speaker is slightly better, be cause there were few speech data to be recorded while. nication Channels, pp.. [3] [4]. 82.3%. These experimental results show the effective. each state: these are estimated by maximizing the like. [G] V. Abr回h,A.. Sanl塙r, H. Franco, and M.. Proc. ICASSP, pp.. [7]. Cohen,“Acoustic. 729-7,1996. A. P. Varga and R.. K. Moore,“Hid巾n Markov model de. composition of speech and noise," in Proc. ICASSP, pp.. 845-. 848,1990 [8]. M.. J. F. Gales and S. J. Young, “An improved approach. to the hidden Markov modcl decomposition of speech and noise," in Proc. ICASSP, pp.. [9]. to position g3 increases.. 23:3-2.36,1992.. 1<. Shikano, and Y目�Iinami, '‘Recognition of noisy speech by composition of hidden恥!arkov models," in Pro(' F. Martin,. EUROSPEECH93, pp.. [10]. 10:31-1034,1993.. Y. Minami and S. Furui, "A maximum likelihood. pr(刊Jυlρ cümpns'. for a univcrsal adaptation methocl basecl on HM1vl tion," in Proc. ICA8SP, pp. 129-IT2,1995.. CONCLUSION. This paper has detailed a robust speech recognition technique for acoustic model adaptation based on Hl\Hvl. 1406. 216. 121-1:2"1,. adaptation using transformations of HMM parameters," in. state to position gl is highest in the first interval. The. 5.. pp.. 1995. position g3, the transition probability from the initial more time elapses, the more the transition probability. 1990. A. Sankar and C-H. Lee,“Robust speech recognition based. on stochastic matching," in Proc. JCASSP,. lihood of one sentence of testing data every 0.8 msec. As the testing speaker is walking from position gl to. A coω tica l and Environmentα1 RobustTless in A11-. partment, CMU, Sep.. [5]. of the distant moving speaker. Figure 4 shows the esti mated transition probabilities from the initial state to. A. Ace 爪. 1979. t omαtic Speech Recognition. Ph.D Dissertation, ECE Dc. tic transfer functions at gl, g2, and g3 is impl'Oved to. ness of the ergodic HMMs for recognition of the speech. 155-158, 1997.. S. F. Boll,“Suppression of acoustic noise in speech using spectral subtraction," IEEE, ASSP-27, No.2,. the distant moving speaker was in the vicinity of gl. The performance with the ergodic HMMs of aεous. Work. shop on Robust Speech Recognition for Unknown Commu.
(5)
図
関連したドキュメント
Segmentation along the time axis for fast response, nonlinear normalization for emphasizing important information with small magnitude, averaging samples of the brain waves
In order to estimate the noise spectrum quickly and accurately, a detection method for a speech-absent frame and a speech-present frame by using a voice activity detector (VAD)
We concluded that the false alarm rate for short term visual memory increases in the elderly, but it decreases when recognition judgments can be made based on familiarity.. Key
patient with apraxia of speech -A preliminary case report-, Annual Bulletin, RILP, Univ.. J.: Apraxia of speech in patients with Broca's aphasia ; A
In this paper, we focus not only on proving the global stability properties for the case of continuous age by constructing suitable Lyapunov functions, but also on giving
This paper deals with the a design of an LPV controller with one scheduling parameter based on a simple nonlinear MR damper model, b design of a free-model controller based on
Because of the knowledge, experience, and background of each expert are different and vague, different types of 2-tuple linguistic variable are suitable used to express experts’
These authors make the following objection to the classical Cahn-Hilliard theory: it does not seem to arise from an exact macroscopic description of microscopic models of