Frequency Domain Semi-Blind Signal Separation: Application to the Rejection of Internal Noises

全文

(1)FREQUENCY DOMAIN SEMI-BL町D SIGNAL SEPARATION: APPLICATION TO THE REJECTION OF n河TERNAL NOISES Jα'ni Even， Hiroshi S，αruwαtαri， Kiyohiro Shikano Graduate school of infonnation science Nara Institute of science and technology lkoma， Nara， Japan. A BS TRACT. γ.司et 5peech Exl noises. Fig. 1.. General situation.. the user's speech contaminated with noise， we would recover the speech and the noise separately. The frequency domain approach， refeηed to as FD-BSS， is especially of great in terest since the convolutive mixture modeling the reverberant environment can be efficiently processed in the 合equency do main. However， this is still a challenging task in a real envi ronment where the number of interfering noise signals is large and the amount of data is limited.. 11ldex Terms- Semi-Blind signal separation， speech pro. cessmg 1. INTRODUCTION. In this paper， we consider the case where some additional information is available. For example， consider a navigation system in a car with a hands-free speech recognition inter face that uses FD-BSS to improve the received speech. lf the driver listen to music then the system should use information from the music player to obtain better performance. This is a semi-blind approach because a reference to one of the sig nals is available. Note that the system knöws what music was emitted but still have to determine the received music in the observed speech. Figure 1 iJlustrates the situation. The hands free speech recognition system uses a microphone a汀ay that picks the user's speech and noises. The noises are compos巴d of the exterior noises and the interior noises. The interior noises being the noises for which references are available. In a real environment， to improve the separation it seem neces SaJγto exploit all information. For this purpose， we propose a semi-blind signal separation method that operates in the fre quency domain in order to replace the FD・BSS approach. Af・ ter presenting the new method， its performances are compared to the blind approach in a realistic environment.. Nowadays， communicating with machines is usually not nat ural and requires some adaptation or training. In order to im prove the usability of these machines and reduce the burden for the users it is important to recreate the natural human com・ munication interface: Speech. The most difficult task being to give machines the ability to listen. Speech recognition is working well if we use a microphone clos巴 to the user's mouth but this is not a natural interface and not a convenient one in many situations. For these reasons， the focus is now on hands free speech recognition. In hands-free speech recognition， the user's voice is picked at distance by a microphone array mak ing a more natural interface with the machine. However， the cost is that noise and reverberation deteriorate the received speech quality. Hence it is necessary to improve the quality of the received speech before speech recognition is performed. In order to deal with the noise， blind signal separation (BSS) based techniques are strong candidates for processing the multidimensional observation given by microphone arrays (see review paper [1]). The goal of BSS is to separate the ob served signal in its di仔erent components. Ideally， receiving. 1・4244・1484・9/08/$25.00 <!)2008 IEEE. Mix旬開.Im隠.町r.y. 亡今X(f.'.). 作up し似叫点、 w 、…. SU，t)くと、. Recently， methods using blind signal separation were pro posed to separate the signals received by a microphone a汀ay. In this paper， we propose a new frequency domain semi-blind source separation method for replacing the blind source sep aration method when it is possible to obtain additional infor mation on some of the signals. This is of particular interest in situations like in hands-合ee speech recognition where the blind separation has to work on limited amount of data in a challenging environment. The proposed method incorporates references to some of the signals that are obtained by addi tional sensors. Some experimental results shows that the pro posed method is able to incorporate the additional information efficiently and that the performances釘e improved in teロn of SNR and word accuracy in a speech recognition task.. 157. ICASSP2008. 円i.

(2) t). The mutual information ofY (k)(f， is minimized by updat ing the matrix B(f) with the following rule (the 企equency and合ame indexes were dropped due to space limitation). Y(f，t) Fig. 2.. Mixture and blind separation at frequency bin f.. B (k l) B(k) +μ(1- <φ(y(k))y(k)H>t)B(k) (4) where < . >t denotes frame averaging andφ(-) denotes the vector of score functions. For Y [y1， .・・，Ypf this vector +. 2. FREQUENCY DOMAIN BLIND SIGNAL SEPARATION. φ(Y). Y(f， t)=B(f)V(f，t) =B(f)A(f"ß(f， t).. t). [�仏t) 1 _ r A(f) R(f， t) II -一 II ;:: --. (2). 0. I. ]. B(f) II I叩) ;')J.，( I II N(f， ;T�J; '!\ t)) I1. C(f). (5). t)，. The observed signals and the sources are both partitioned of in two vecto瓜The first part of the observations X(f， size (p x T) with T the number of企ame， is a mixtures of both S(f， (p x T) and N(j， t) (q x T) whereas the second part of the observations R(f， (q x T) is only a function of N(f， t). This strucωre corresponds to the situation described in fig.l， with p external signals and q internal noises. In the following we use the terms references for R(f， and observations for X(f， t). A diagram of the mixing is given in Figふ The proposed demixer has a block structure of compatible dimensions with the matrices A(f)， B(f) and C(f).. t). t). t). y(k)(j， t) B(k)(f) V(j， t). 3. PROPOSED METHOD. The goal of the proposed semi-blind approach is also to re cover some unknown signals when only some mixtures of these signals are available. However， contrarγto the fully blind separation case， we are also given an additional infor mation about the observed mixtures. We know that the mix ing process has the following block structure. where P(f) is a n x n p巴rmutation matrix and A(f) is a di・ agonal n x n matrix [2]. O As a consequence， in each frequency bin， it is possiゐble tω recover the c∞omp卯0叩n出E叩n附t岱soぱf S(げf， tの) u叩P tωo sωca此le叩dp戸ermuta司 t巾ion indeterminacy b防y 白命fin叫叫din暗gt批he unrrπ凶I gives s幻ta瓜tis託凶ti比call砂y i加nd巴pe叩ndent component臼s. This problem is often referred to as independent component analysis (ICA). To complete the separation， it is necessary to match the com ponents belonging to the same signal across all the仕equency bins before applying the inverse STFT otherwise th巴 time do main signals are still mixtures of th巴 desired signals. Since our proposed semi-blind method is derived合om the iterative 削FO:rv仏X method [3]， we briefly present this method (see review [ 1] for r巴ference to other methods). In the frequency bin f at the kth iteration， the separation equation is. =. [ゆ(Yl)，" . ，<þ(YpW. 3.1. Block structure. A usual assumption in BSS is that the∞mponents of S(f， are statistically independent in each frequency bin. Then the components ofY(f， t) are statistically independent if and only if B(f) is such that. P(f)A(f)S(f，. -�O -. l. t. t). t) =. ト三logP θ Yl (Yl)ぃ・・ ' 一三log θYP PyP (Yp)lT Yl. where P，仏(Yi) is the probability density function of Y・ i In practic巴 the score functions are unknown and should be esti mated合om the data or prior knowledge on the signal densities is available.. 、‘，ノ〆，，‘、. t) = A(f)S(f， t). =. =. where the n x n matrix A(f) represents the mixture， S(f， are the emitted signals at the fth frequency bin and denotes the frame index. Consequently when using a F points叩al ysis企ame for the STFT the convolutive mixture is replaced by F instantaneous mixtures and the goal is to estimate the components of the emitted signals S(f， in each frequency bin. In the fth 合equency bin， the estimates are obtained by applying an unmixing matrix B(f) to the observed signals (see Fig.2). Y(f，. =. is defined by. In acoustic， the observed signals received by a microphone array in a reverberant envitonment are convolutive mixtures of some signals emitted from different locations. The goal of BSS is to recover the emitted signals knowing only the ob served mixtures. In the frequency domain approach to BSS， a short time Fourier transform STFT is applied to the observed signals to get the frequency domain observations. Then the observed signals at the fth企equ巴ncy bin are. V(f，. =. t). t). [お::;ト[wjf)ml liml. Compared to the blind problem of same dimension， the num ber of coefficients to update is reduced Using the results in [2] presented in Sect.2， the compo nents of Y (1， t) and Q (f， are all statistically independent if and only if the matrices W1 (1)， W2(f) and W3(f) are such that. t). W2(f) I I A(f) B(f) I W3(f) I I 0 C(f) I 一. _. 0 [川川. (3). 158. �. ]. 九(f)AA22(f) (f ) I. 。。旬BA.

(3) where (dropping the frequency and frame indexes for Y(f，t) and Q(f，t)). Target speech +. Ext. noises. S(.f. t). Estimates. 官 ;江主h γ(J，t;). T. N(.f，t). 一. Inl noises. ムW叫吋JrPk刈k)(fげωf刀) = (I- <φ(がりが)H>付1k)(f). It(fj)i. The合equency domain signals are approximately circular be cause they were obtained by a STFT. For a circular random variable ν= Iyleiargy we have. Internal noise reference Fig. 3.. ゆ(ν) = φ(Iyl)ei argy. 810ck structure of the mixture.. Thus the unknown score functions can be estimated合om the data using a kernel based estimate of the score function of their modulus. After the semi-blind separation is performed in all the合e quency bins， the permutation resolution is also simplified be cause of the block structure.. where P1(f)(P x p)加d P2(f) (q x q) are permutation matri ces and A1(f)(P x p) and A2(J) (q x q) are diagonal ma trices. Consequently it is possible to estimate the compo nents of S(f， t) and N(f，t) by updating W (1)， W2(f) and I W3(f) until the components ofY(f，t)叩d Q(f，t) are all sta tistically independent (Note that an echo canceler [4] cancels the contribution of N(f，t) in X(f， t) but does not recover. 4. EXPERIMENTAL RESULTS. S(f，t)).. To demonstrate the importance of the intemal noise reference we performed some experiments mixing the noise recorded in a train station as external noise and a synthetic non stationary noise as internal noise. The impulse response of the train sta tion hall was also measured for a speaker at 50cm in front of a four microphone array (inter mic. spacing is 2.15cm). 200 Japanese sentences of different 1巴ngth were used as speech signals (2s to 14s at 16kHz from the介�AS database [6]). The observed signals are obtained in two steps. First a speech signal convoluted by the impulse response is mixed with the recorded noise. The SNR in this mixture is called SNR ext. Then the mixed speech and external noise is mixed wi出the internal noise that is filtered by a low pass filter. The SNR for this second mixture is SNR int. We also filter the internal noise to obtain the reference. In all experiments we compared the iterative INFOMAX approach (blind) to the proposed approach (semi-blind). The STFT is performed with a 512 points hanning window with 256 points ove巾p. Th巴mat巾e岱s B(げ1)訂巴inωi t討ity in all f合k尚equ巴叩ncy bins then 20∞o iterations are pe釘rfiおormed W山it出h叩 adaptation s坑te叩pμ= 0.1. The s叩pe白ech signal is s印e lected out of the separated c∞ompon巴nt臼s in all the fì仕r巴quency bins using th巴 same method for both approach. The INFO MAX method considers the reference signal as a fifth ob servation. Then both algorithms have the same amount of statistical information. The only difference is that the semi blind approach knows that the mixture has the block structure showed in Fig.3. The estimation quality is measured in term of noise re duction rate (NRR) defined as the difference of the SNR for. 3.2. Proposed algorithm. The proposed semi-blind separation method uses the mutual information of Y(f，t) and Q(f，t) to measure the statisti・ cal independence of their components. The criterion is opti・ mized by an iterative gradient descent on the matrices W1(f)， W2(f) and W3(J). At iteration k， we have the following unmlxmg system. ドQ伏)(k)(f，t) 卜. I. (f，t) 一 1. |l. wl�ω り )ω X(fj) 0 WJk)(f) 1 l R(J，t). l. To obtain the update rules for these matrices we rewrite the update rule in the blind case eq.(4) with the proposed demixer structure. 叫吋叶r附糾叩川川+1刊1り川川)( O WJrk糾k+1刊1川f) 1-一| [. WJrk+I円f) ，φ(Y( め (f，t)) l，Y (刈(f，t) 1 H \ 一μハ(� lp+ Q-<""' 1 φ(Q(k)(f，t))J l Q(k)(f，t)J /t J 叩川糾 +附w 川刊叫 lk+1り町 )(川 f 昨吋 O WJrk糾叩+1川1) Then the update rules for the matrices W1(f)， W2(f) and W3(f) are extracted (A semi-blind method for instantaneous r. ム吋)(f) = (Iー<φ(Y伏))が附>t) wlk)(f) ムWJk)(f) = (I- <<Þ(刊))y(k)H>村'Jk)(f) (<φ<Þ (YσμY戸刊刊(伏刷切k刈). O. �. �. x[. mixtures in the time domain uses the s創ne approach to get the update rules in [5]). The update rules for the matrices have the following form. W}k+1)(f) = W?)(f) +μムwjk)(f). 159. nwd tEム唱aA.

(4) 12. ，。. ，。. 百. 8. e. i:. Unproc.. 口. 。暗. 81ind. 口. �n..[te]. .... (c) SNR ext. 20dB. (b) SNR ext. 15dB. (a) SNR ext. 10dB. B. lind 是E. .0. o �・ LJ ，..，. •••. S同性roll'. ...，. &測B. (d) SNR ext. 10dB. SNR皿[OIl'. 以到B. .... "". S明rt.(，咽. (f) SNR ext. 20dB. (e) SNR ext. 15dB. Fig. 4. Noise r蹴reduction。恨R) and word accuracy for different SNRs.. 6. REFERENCES. the speech estimates (after processing) and the SNR for the observations (before processing). Consequently， a positive. [1] M.S. P edersen et aI.，. NRR means that the speech estimate quality is improved. Fig. “'A survey of convolutive bJind. ures 4(a)， (b) and (c) show the NRR for mixtures at different. source separation methods，". SNRs (averaged on the 200 test signals). The second measure. Speech Communication， 2007.. of performance is the word accuracy for a continuous speech. [2] P. Comon，“Independent component analysis，a new con. recognition task.η1巴 speech recognition conditions are given in table 1 and the results in Figs. 4(d)，(e) and. Springer Handbook on. cept 7，" Signal Processing， vol. 36， pp. 287-314，1994.. (η.. The blind method is able to improve the speech signal but. [3] A. 1. Bell and T. 1. Sejnowski，“'An information maxim. using the block structure gives the advantage to the s巴mi-blind. imization approach to blind separation and blind decon. method when the number of iterations is limited. The perfor. volution，" Neural Computation， vol. 7， no. 6， pp. 1129-. mance of the blind method would increase if the number of. 1159，1995.. iterations is larger but in a real situation computation time. [4] 1. Benesty et al.，“'A better understanding and an improved. is limited. The performance difference is also larger for the. solution to the specifìc problems of stereophonic acoustic. shorter sentences.. " IEEE Trans. Speech Audio Process echo cancell:杭on，、 ing， vol. 6， pp. 156-165，1998.. Table 1. Conditions for speech recognition Acoustic model. I I. Acoustic model training. I 260 speakers. Task. Decoder. I. [5] M. Joho et al.，“Combined blind/nonblind source separa. 20k wo吋newspaper dictation. tion based on the naturaJ gradient，" IEEE Signal Process. phonetic tied mixture，. ing Letters， vol. 8， no. 8， pp. 236ー238，2001.. clean model [7]. (150. [6] K. Ito et al.， “Jnas: Japanese speech corpus for large. sentences/speaker). vocabulalγcontinuous speech recognition research，" The. n江IUS ver 3.2 [7]. Journal 01Acoustical Society 01Japanパ01. 20， pp. 19ι 206，1999. [7] A. Lee et al.， “Julius - an open source real-time large vocabulary recognition engine，". 5. CONCLUSION. 1691-1694，2001.. In this paper we proposed a semi-blind separation approach that operates in the合巴quency domain. The method easily incorporates the information given by additional sensors to the BSS based approach. Experiments showed that this can be very benefìciaJ in a hands-free speech recognition scenario.. 160. EUROSPEECH" pp.. ハU つ臼.

(5)