Improvement of Acoustic Model for Hands-Free Speech Recognition Using Spatial Subtraction Array

全文

(1).;. 2007 RISP International Workshop on Nonlinear Circuits and Signal Processing (NCSP・07) Shanghai Jiao Tong University， Shanghai， China， Mar. 3-6， 2007. lmprovement of Acoustic Model for Hal1ds-Free Speech Recogl1ition Using Spatial Subtractiol1 Arrav. Ayase Takagii' ， Yoshimitsu Morit ， Yu Takahashi'" ， Hiroshi Saruwatari t and Kiyohiro Shikano+. tNara Institute of Science and Technology 8916-5， Takayama-cho， Ikoma-shi， Nara， 630-0192， JAPAN Phone:+81-743-72-5287 FAX:+81-743-72-3321 Email: ayase-t@is.naist. j. Abstract. Noise and reverberation adaptation techniques are essential to realize robust speech recognition in actual environments. 1n this paper， fìrst we describe a new noise-robust speech recog nition system combined with our previously proposed spa tial subtraction a汀ay (SSA ) which is effective for hands-free speech recognition. Next， in the SSA曲based speech recog nition system， we newly introduce an SSA-matched acoustic model， and assess the model's portability and robustness in various environments. SSA-matched acoustic model can be applied to compensation of the e仔'ects due to noise， reverber ation and distortion via signal processing of SSA. The expeト imental results show that our proposed SSA-matched model marks highest recognition perfoロnance and is still available among environment changes. 1. Introduction. Recent progress in speech recognition technology has been encouraging various practical speech applications such as robot interface. However， required equipment of a hands-on microphone or a head set microphone to users spoils its con venience and naturalness. Recognition of distance talking by far microphone may ease this problem. Therefore， Hands free speech system is demanded A serious problem of hands-free speech recognition in real environments is recognition-accuracy degradation due to the reverberation effect and background noise. Therefore， the speech recognition performance is often degraded signifì cantly. A possible solution is to address the problem the noise suppression by microphone aηay signal processing. Many types of microphone aηays， e.g.， Oelay-anιSum (OS ) [ ] and Griffi.th-Jim adaptive a汀ay (GJ ) [ ]， have been proposed in the past research. Although GJ can achieve a superior peト formance relatively to others， GJ requires a huge amount of calculations for learning adaptive F1R-fìlter of thousands or millions of taps. 1n order to resolve this problem， we newly propose an acoustic model for a spatial subtraction a汀ay (SSA ) [ ] and. cc(i). Figure 1: Configuration of SSA.. its improvement method to achieve robust hands-free speech recognition under noisy environments. 1n the proposed method， the noise reduction is achieved by subtracting the es timated noise spectrum合om target speech spectrum to be en hanced in the mel-scale fìlter bank domain. Moreover， since the proposed method is performs in the mel-scale fìlter bank domain， the transform into mel-frequency cepstrum coeffi. cient (MFCC )[ ] become easier， which reduces the amount of calculation in SSA compared to that of GJ. The experi mental results obtained under a real environment reveal that word accuracy of the proposed method is greater than those DS and GJ. In addition we propose SSA-matched model. The report that improved the model who used known noise excel lence [ ] or excellence of the reverberation used. The con ventional researches have introduced a simple acoustic model which was trained under noise-less and distortion-less condi tions. Signal-to-noise ratio (SNR ) is improved when we use a spectrum subtraction， but a speech spec汀um after processing changes合om a original speech greatly. Therefore we can not perform an accurate speech recognition by using this acous tic model. We create acoustic model that a noise， influence of the reverberation and a signal distortion to occur when we handled it in SSA. We assume actual environments， and we tested it with plural acoustic models. - 473- 231-.

(2) 2. Proposed method 2.1. SSA. Figure ! shows a configuration of the proposed SSA. SSA includes mel-scale filter bank analysis， and outputs mel frequency cepstrum coe伍cient (MFCC ). The triangular win dow W(k; f) (1 = 1" L) to perform mel-scale filter bank anal ysis is designated as follows: k - k1o(乃. t. khi (f) - kc(乃. 一一 | 一一 o 乃句ポ k | 一一( 一一一. �. (k1o (乃5，k 5，kc(乃)， (kc(悩仇(乃)，. where k1o(乃， kc(乃and khi (f) are the lower， center ， and higher frequency bins of each triangle window respectively. They satisfシthe relation among adjacent windows as (2) kc(乃= khi(l- 1) = k1o(l + 1) ， Moreover， kc(乃IS aπanged in re伊lar intervals on mel 合equency domain. Mel-scale 合equency Mel九(() for kc(乃 is calculated as. -. kc(f) ・!S) f， +• 一一一一é.:: k el.mリ (3) 山 = 2595102，n� ロ1V t 1 700・KJ' where !s is the sampling合equency and M is the DFT size. In the proposed SSA， noise reduction is achieved by sub tracting the estimates noise power spec汀um 合om the target speech power spectrum to be enhanced in the mel-scale filter bank domain. This 0仔'ers a realization of error-robust noise reduction with few computational complexities because the parameters are oPUTTIized in the small number of mel-scale filter banks. This procedure is given by. m(l， T) = klU(乃. L:W(川\lYos(か) 12ーα(りβIZNBF(わ)12} t，. k=k，o(乃 kh，(乃. Figure 2: SSA matched acoustic model with SSA.. ) l (. W(川=. (. (if IYDs (k， T)12ーα(f)・β.IZNBF(k， T)12と0)，. (4). 2エ; W例(川{けY'Iげ陥Y均恥D凶附s. k=k，o(乃. where k is the合equency bins，k1o (f) and klバf) are the lower and higher frequency bins of each triangle windows respectively， T is the number of filter， m(l， T) is the output合om the mel scale白lter bank， W(k，乃is the triangular window to perform mel-scale filter bank analysis. YDs (k ) is the output signal from DS， i.e.， the partly enhanced speech signal， and YNBF (k ) is the output signal from NBF in which the directional null steers in DOA of the user， i.e.， the estimated noise signal. The system switches in two equations depending on the conditions in (1). m (乃is a function of the subtraction coefficientβand the pa rameterα (乃which is detennined during a speech break. On the other hand， if the power spectrum take a negative value，. m(l) is obtained by using flooring processing whereγis the flooring coefficient. Since a common speech recognition is not so sensitive against phase information， the proposed SSA which in ap plic�ble �or th� :�e�? reco� ition. ?J req��s the a dap . ， tive leaming of FIR-filters of thousands or millions ofr taps. On the other hand， in general， the order of filter bank 1 is set to 24， and consequently the proposed method optimizes only 24-parameters. Moreover， the proposed method is per.・ formed in the mel-scale filter bank domain and the transform into MFCC as follows:. WCC(h. G 5 M州(l- j) Z ). (5). 2.2. ACOUS6c model improvement for SSA. In the speech recognizer part， the conventional researches have introduced a simple acoustic model which was trained under noise-less and distortion幽less conditions， so called clean model. However there exist the following problems es pecially in SSA applications: (a ) SSÁ's noise r�cÎuction performance is limited to some ex tent， and consequently there are still some residual noise com・ ponents in the SSA's output. (b ) SSA involves highly nonlinear signal processing， and this leads to creation of sound distortion in the enhanced target speech. The above-mentioned deviations 合om ideal clean conditions give us a mismatch of acoustic model in the speech recognizer， and this yields heaηr deterioration in word recog nition score. To solve this problem， in this work we propose to construct a noise- and distortion-specific acoustic model which reflects prospective conditions in the actual use of SSA. The仕aining of the acoustic model is carried out by con・ sidering the characteristics on the sound distortion and noise residuals， which can be done with handling of SSA in an in put speech. This strategy can really make distortions of SSA used at the time of recognition agree with the acoustic model. - 474- 232-.

(3) |.. _ _ _ _ _ _ _ _ _ _ _ 1 ._ê'I1. Table 1: Experirnental conditions. +1 Reverberation time: 200 ms. Oatabase. EN寸. qser 務 1m ; Noise /. Task. '. ;勾�'1.5m. 1.1. User angle Noise Noise angle Noise supenmposJtlOn Acoustic model. ?tpo. ml. 2.15cm. Figl.lre 3: Layol.lt of reverberant 200-ms reverberation room used in experiments. Oecoder Filter size Fram巴slze Sampling freql.lency Known noise Amol.lnt of supe1'lmposltlon β(SSA) γ(SSA). 3.33 m. 今I Reverberation time: 260 ms 川 Number01 microphones : 8. E守∞. N. qser必. 1m: 刀叶サoise l貝0，，'1m. JNAS [ ]、 speakers (150 sentences / 1 speaker) 20-k newspaper dictation 。. 。. cIeaner 600 10 dB PTM [ ] (2000 stats， 64 mixture size) clean， matched， clean & reverberation， matched & reverberation JuIius ver.3. 4. 2 [ ] 32 ms (512 taps) 25 111S (400 samples) 1 6 kHz Office room noise 3 0 dB 0.5，1.0、2.0 0. 2. 。00。 2.15cm. Figure 4: Layout of reverberant 260-ms reverberation room used in experiments.. trained in advance. Figure 2 depicts an overview of the pro posed method for creating SSA・matched acoustic model. A flow of concrete processing is as follows: 1. We convolute an impulse response measured beforehand by microphone a汀ay in a speech database. 2. In a speech database made with 1， we sl.lperimpose ex cellence of an exterior noise of a fixed quantity. 3. An SSA dispense for 2. And it is an expression by a case(1). We perform known noise excellence after sub traction we am similar， and to be able to put [ ]. 4. We leam it with EM algorithm and make an SSA matched acoustic model. 5. We give SSA for an input evaluation speech and perform known noise superimposition excellence after spectral subtraction by a case. We do speech recognition it with 4. SSA-matched models.. 3. Experiments and results. ln this section， we evaluate our proposed noise and reveト beration robust speech recognition. 3.1. Conditions. Figures -' and:l show layouts of the reverberant rooms used in the experiment and Tàble 1 shows the experimental con ditions. ln the experiment， we use the following signals as testing data: the original speech convoluted with the impulse responses which are recorded in the actual environrnent， and added with exterior noise and office noise which is included in the actual environrnent. ln this paper， we compare clean model， reverberation and known noise matched model， rever beration and exterior noise matched model， Oelay-and-Sum (OS ) matched modeI and SSA matched model. We construct each model under the conditions of260・ms reverberation and 200-ms reverberation. Regarding the recorder，尺JLIUS is used. We use a Phonetic Tied Mixture[2] model from JNAS database. The evaluation task is the JNAS newspaper dic tation task with 20k vocabulary size. The baseline speaker independent acoustic models are trained 合om 260 仕aining speakers' data in JNAS speech database. PTM， phonetic tied mixture models. are used. The PTM training speech database includes 260 speakers (39000 sentence u口erances in total).. - 475 - 233 -.

(4) 65 率三55. υ. 5 45 � 百35 0 ;;: 2 5. 口SSA Matched. Model. ß = 1.0国SSA Matched. Model. ß = 2.0. Figure 6: Result of SSA matched model and mismatched model. Figure 5: Result ofSSA matched model. among environment changes. The test set consists of another 46 speakers from汁-JAS. Each test speaker utters 4 or 5 newspaper article sentences (200 test sentence utterances in total ). The distance between the microphone aηay and the loudspeakers is 1.0m. The experi ment conditions are sumrnarized in Table 1.. Acknowledgments. Part of this work is supported by MEXT (Ministry of Edu cation， Culture， Sports， Science and Technology ) e-Society leading project. References. 3.2. ResuIts. First compare same environments acoustic models that clean model， reverberation and known noise matched model， reverberation and exterior noise matched model， Delay-and Sum (DS ) matched model and SSA matched model on the basis of word accuracy scores. Figureラshows all acous・ tic models and the SSA in word accuracy score results. The SSA-matched acoustic model shows a higher recognition per formance than the other models which include no (or less ) considerations of residual noise and distortion effects. From the results， it is speculated that the word accuracy depends on quantity of characteristics to be considered when we really use SSA for speech recognition processing， and our proposed a coustic model with SSA is well matched to it. T he difference by a subtraction parameter of SSA was not seen very much. Figure () show mismatched environments model and the SSA in word accuracy score results. An SSA-matched model led the best result. In addition， the SSA-matched acoustic model is still a good model for recognition even if acoustical envi ronments change. The difference by a subtraction parameter of SSA was not seen very much. 4. Conclusion. In this paper， to address the acoustic model problem in non linear a汀ay signal processing， SSA， we proposed to construct SSA・matched acoustic model. We showed the experimen tal evaluation of our model， and revealed an e仔ectiveness of SSA-matched acoustic model. SSA-matched acoustic model provided a higher word accuracy score than the other conven tional models. We also show the robustness against difference. [1] H. Saruwatari， S. Kurita， K. Takeda， F. Itakura， T， Nishikawa， K. Shikano，“Blind source separation combining independent component analysis and beamforming，" EURASIP Journal on Applied Signal Processing， voJ.2003， no.11， pp.1135-1146， 2003. (2) L. 1. Gri伍th，and C. W. Jim，“An a1temative approach to linearly constrained adaptive beamfoロning，" IEEE Trans. Antennas Propagation， vo1.30，no.1， pp.27-34， 1982 [3] Y. Ohashi， T. Nishikawa， H. Saruwatari， A. Lee， K. Shikano， “Noise-robust hands-free speech recognition based on spatial subtraction a汀ay and known noise superimposition，" Proc IEEE/RSJ International Conlerence on Intel/igent Robots and Systems，. [4]. pp.533-537，2005.. S. B. Davis， and P. Me口ne1stein，“Comparison of parametric. representations for monosyllabic word recognition in contin uously spoken sentences，" IEEE Trans. Acoustics， Speech， Signal Proc.， voJ.ASSP-28， no.4，pp357-366，1982 [5] S. Yamade， A. Lee， H. Saruwatari， and K. Shikano，“Unsu pervised speaker adaptation based on HMM sufficient in vari ous noisy environments，" Proc. EUROSPEECH， pp.II-1493 1496，2003. [6] K. Ito， M. Yamamoto， K. Takeda，T. Takezawa，工Matsuoka，T. Kobayashi， K. Shikano， and S. Itahashi“JNAS:Japanese Speech Corpus for Large Vocabulary Continuous Speech Recognition Research，" Journal 01 the Acoustical Society 01 Japan (E)， voJ.20， pp.199-206， 1999. [ï] A. Lee，T. Kawahara， K目Takeda，K. Shikano，“A new phonetic tied-mixture model for efficient decoding，" Proc. ICASSP， voJ.III， pp.1269-1272， 2000 [8] A. Lee， T. Kawahara， and K. Shikano， “Julius - An open source real-time large vocabulary recognition engine，" Proc. EUROSPEECH， pp.1691-1694， 2001. - 476- 234-.

(5)