音声と雑音両方の状態遷移過程を有する雑音下音声区間検出

全文

(1)2006/. IPSJ SIG Technical Reports. ttntt T 619-0237. E-mail: {masakiyo, ishizuka, katohi}@cslab.kecl.ny.co.jp. A noise robust voice activity detection. with state transition processes of speech and noise Masakiyo FUJIMOTOt, Kentaro ISHIZUKAt, and Hiroko KATOf f NTT Communicaition Science Laboratories, NTT Corp. 2-4, Hikaridai, Seika-cho, Souraku-gun, Kyoto, 619-0237, Japan. E-mail: {masakiyo, ishizuka, katohi}@cslab.kecl.ny.co.jp Abstract. This paper proposes a noise robust voice activity detection with state transition processes of speech and. noise. The proposed method constructs a clean speech / silence state transition model beforehand, and sequentially. adapts the model to noise environment by using a parallel non-linear Kalman filtering when the observed signal is. given. Speech / non-speech discrimination is carried out by calculating the likelihood ratio of a speech (clean speech + noise) state to a non-speech (silence + noise) state with the adapted model. In addition, a backward techniques,. i.e., a parallel Kalman smoother and a backward probability estimation, are used to estimate the noise and for the likelihood ratio calculation.. Key words. voice activity detection, non-stationary noise, state transition process, parallel non-linear Kalman. filter / smoother, forward-backward estimation. W. X. t Signal to Noise Ratio. : Voice Activity Detection) tt,. -13-.

(2) it,. . Sohn. z. b td <fc 9,. U Sohn. r.© J:. f vad. . *fc, Sohn It,. VAD J SNR ©. , SNR (l. (GMM: Gaussian Mixture. Wft"?!*. Model) 2.2. Shon. u. v-. 2. 2.1. VAD. VAD[5] ■. Sohn. ) ^^^^fll (Hi). r. i I-J:«?,. -14-. . Ergodic.

(3) -5.. Vm). Noise model. /If!). O0:t, N0:t = {No,--- ,. <N»i). p(9t|Oo:t,N0:t) = p(Oo:t,gt,No:t)/p(Oo:t,No:i) Composition. oc p(Oo:t,gt,No:t). (5). ort, gt,N0:. p(00:t,gt,No:t) =. xp(Oo:t-i,gt-i,N0st-i) Noisy speech model. (6). gt k Nt. t.gt.No:*) = ^p(gt|gt-i)p(Nt|Nt-i)p(Ot|gt,Nt) xp(Oo:t-i,gt-i,N0:t-i). , p(Nt|Nt-i). (7). =. ct,t-i,. 2.3. ST.. t-i] 6i,Nt (Ot) ct,t-i. (8). , ct,t-i=l fc*5(tts). iot, (1) 14,. j,Nt (Ot). (2). (2). p(Oo:t,N0:t). = p(Nt|Nt_i)p(Ot|Nt)p(Oo:t-i,No:t-i). &^X, p(qt = ffilft-i = Hi) tp(Ot\qt =. p/#^p<Dymm <9, p(qt = Hj\qt-i = Hi). , p(Ot|gt = ff,) = 6, (Ot). = l,. a1(o = 0. ■I. (3). 3.. t=O. 3.1. = aift/ao,t. Ho. Rt < Threshold. Hi. Rt ^ Threshold. (9). (4). -15-. (10).

(4) t,. (23) (24). (11). (25). otJ>fc)1 /dNt >jik,i. (12). (26). t|t - 1 \ZtifM t - 1. i: ^tJifcil tt-ttt-^JK GMM j,. i + log (1 + exp (.1^,1-St,i)) (13). St,i,Nt,i). ', otii i*Ot <D%$im%, st,i . a; (13). K. (27). V*li)+Vtfilik1i. fc=l. (14). l.(fciJ). K. (15). (28). lfl ft GMM j (j = 0: &% GMM, j = (29). (GMMj,. c:*, GMM feic. Itlb, Switching *^W. *ft»t fc 6i>Nt (Ot). (17) (30). (18) j,Nt (Ot) (19) 3.2. (20). ^ (7). (Ot,i - Mot|t_lii)fc(J) (22). -16-. (31).

(5) tt <rlr, = 0.0001. = P (O0:t, gt, N0:t) P (Ot+l:T, Nt+l:T|gt, Nt). (32). ?". .:T,Nt+lsT|gt,Nt) P ^t+l:Tj *^t+l:T|gtj *^>t J. '. *. '. \. /'. V. /. fcfp-t, FAR (False Acceptance Rate) t FRR (False Rejec tion Rate) t?fcS. FAR £ FRR ft, SVMd N U- K*. xp (Ot+2:T,Nt+2:r|ft+i,Nt+i). (33). ^^*>5^-e±ît^j3tîS:#: Threshoi ', ROC (Receiver Operating Characteristics) ft. >i,t =p(Ot+l:T,Nt+l:T|«t = -Hj.Nt) iot, 5£ (33) tt3£ (8) ©&£&#, p(Nt+i|Nt)p = ct+i,t =. (39). (34). Fiifl. , p (Oo:T5g* = Hj-jNoit) = Qij.t • /5j,t <^ ^ ^ i lifrlRJc?. 4.2 El 4 IC, £«¥& (Proposed) AT*,. P (O0:T,qt = HOi N0:t). ao.t • A),t. (35). 7-f/P*©*|*|*«£ (¥^b TOT*. ], LTSD (Long-Term Spectral Divergence) [3], ITU-. T G-729 Annex B [10], ETSI ES 202 050 [11],. (Previous)[71 O***:^. **, Mk[10],[ll]. yÂ—f [9]. (36). J.*,i) (37) Jt,j,k,i. (38). , VAD. _A_,_. 5. i o?f-|9C$rVfe^^^^ J: 5 HOfcWfcfT 5 75>, SNR ^ J: 0 fEQ!SM(c:. ^. *6 = 0,5,10,15,. tz, tb =. 4.. H. 4.1. 31 8$ ft. @l -^^-^ 2,292 Jgfg (178. (,5,. 1118. 5. t? -r. , a<ti = {0.8,0.2,0.1,0.9} fc t,. -17-.

(6) Sohn - - LTSD X 0.729 Annex B. Sohn. X. + ETSIES202 050. +. Previous Proposed. 0.1. 0.1. 0.2. 0.2. 0.3. 0.4. 0.S. 0.6. 0.7. 0.8. 0.9. 0.2. 0.3. 0.4. 0.5. 0.6. 0.7. False acceptance rate. False acceptance rate. (a) Airport 0 dB. (b) Airport 5 dB. 0.3. 0.4. 0.5. 0.6. 0.7. 0.8. 0.9. False acceptance rate (d) Street 0 dB. 0.2. 0.3. 0.4. 0.5. 0.6. LTSD. G.729 Annex B. ETSI ES 202 05C Proposed. 0.8. .O 0.1 0.2 0.3 0.4 0.5 "oi 0.7 0.8 0.9 1.0. 0.9. False acceptance rate. (c) Airport 10 dB. 0.7. 0.8. 0.9. 0.0. 0.1. 0.2. False acceptance rate. 0.3. 0.4. 0.5. 0.6. 0.7. 0.8. 0.9. 1.0. False acceptance rate. (e) Street 5 dB. (f) Street 10 dB. ROC tijftic ££££&& ing long-term speech information," Speech Communication, Vol. 42, pp. 271-287, Apr. 2004.. [4]. Ishizuka, K. and Kato H.,. "A feature for voice activity. detection derived from speech analysis with the exponen tial autoregressive model," Proc. of ICASSP '06, Toulouse, France, Vol. I, pp. 789-792, May 2006.. [5]. Sohn, J. , Kim, N. S., and Sung, W., "A statistical modelbased voice activity detection,". IEEE Signal Processing. Letters, Vol. 6, No. 1, pp. 1-3, Jan. 1999.. [6]. Ephraim, Y. and Malah, D., "Speech enhancement using a minimum mean-square error short-time spectral amplitude. estimator,". 7Varw. on Ac.ousL, Speech, Signal Processing,. Vol. ASSP-32, pp. 1109-1121, Dec. 1984.. [7]. m* mm,. [8]. M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp,. , 1-2-17, pp. 33-34, Sept. 2006.. "A Tutorial on Particle Filters for Online Nonlinear/NonGaussian Bayesian Tracking," IEEE Trans. SP, Vol. 50, No. 2, pp. 174-188, Feb. 2002. [1]. Rabiner, L. R. and Sambur, M. R., "An algorithm for de. termining the endpoints of isolated utterances,". [9]. Balakrishnan, A.V., "Kalman Filtering Theory," Optimiza. [10]. ITU-T Recommendation G.729 Annex B., "A silence com. The Bell. System Technical Journal, Vol. 54, No. 2, pp. 297-315, Feb.. tion Software, 1987.. 1975.. [2]. activity detection using higher-order statistics in the LPC. [3]. pression scheme for G.729 optimized for terminals conform. Nemer, E., Goubran, R., and Mahmoud, S., "Robust voice. ing to Recommendation V.70," Nov. 1996.. [11]. ETSI standard document, "Speech processing, Transmis. residual domain," IEEE Trans, on Speech and Audio Pro. sion and Quality aspects (STQ), Advanced Distributed. cessing, Vol. 9, No. 3, pp. 217-231, March 2001.. Speech Recognition;. Ramirez, J., Segura, J.C., Benitex, C, de la Torre, A., and. rithm; Compression algorithms," ETSI ES 202 050 v.1.1.4,. Rubio, A., "Efficient voice activity detection algorithm us. Nov. 2005.. -18-. Front-end feature extraction algo.

(7)