音声と雑音両方の状態遷移過程を有する雑音下音声区間検出
6
0
0
全文
(2) it,. . Sohn. z. b td <fc 9,. U Sohn. r.© J:. f vad. . *fc, Sohn It,. VAD J SNR ©. , SNR (l. (GMM: Gaussian Mixture. Wft"?!*. Model) 2.2. Shon. u. v-. 2. 2.1. VAD. VAD[5] ■. Sohn. ) ^^^^fll (Hi). r. i I-J:«?,. -14-. . Ergodic.
(3) -5.. Vm). Noise model. /If!). O0:t, N0:t = {No,--- ,. <N»i). p(9t|Oo:t,N0:t) = p(Oo:t,gt,No:t)/p(Oo:t,No:i) Composition. oc p(Oo:t,gt,No:t). (5). ort, gt,N0:. p(00:t,gt,No:t) =. xp(Oo:t-i,gt-i,N0st-i) Noisy speech model. (6). gt k Nt. t.gt.No:*) = ^p(gt|gt-i)p(Nt|Nt-i)p(Ot|gt,Nt) xp(Oo:t-i,gt-i,N0:t-i). , p(Nt|Nt-i). (7). =. ct,t-i,. 2.3. ST.. t-i] 6i,Nt (Ot) ct,t-i. (8). , ct,t-i=l fc*5(tts). iot, (1) 14,. j,Nt (Ot). (2). (2). p(Oo:t,N0:t). = p(Nt|Nt_i)p(Ot|Nt)p(Oo:t-i,No:t-i). &^X, p(qt = ffilft-i = Hi) tp(Ot\qt =. p/#^p<Dymm <9, p(qt = Hj\qt-i = Hi). , p(Ot|gt = ff,) = 6, (Ot). = l,. a1(o = 0. ■I. (3). 3.. t=O. 3.1. = aift/ao,t. Ho. Rt < Threshold. Hi. Rt ^ Threshold. (9). (4). -15-. (10).
(4) t,. (23) (24). (11). (25). otJ>fc)1 /dNt >jik,i. (12). (26). t|t - 1 \ZtifM t - 1. i: ^tJifcil tt-ttt-^JK GMM j,. i + log (1 + exp (.1^,1-St,i)) (13). St,i,Nt,i). ', otii i*Ot <D%$im%, st,i . a; (13). K. (27). V*li)+Vtfilik1i. fc=l. (14). l.(fciJ). K. (15). (28). lfl ft GMM j (j = 0: &% GMM, j = (29). (GMMj,. c:*, GMM feic. Itlb, Switching *^W. *ft»t fc 6i>Nt (Ot). (17) (30). (18) j,Nt (Ot) (19) 3.2. (20). ^ (7). (Ot,i - Mot|t_lii)fc(J) (22). -16-. (31).
(5) tt <rlr, = 0.0001. = P (O0:t, gt, N0:t) P (Ot+l:T, Nt+l:T|gt, Nt). (32). ?". .:T,Nt+lsT|gt,Nt) P ^t+l:Tj *^t+l:T|gtj *^>t J. '. *. '. \. /'. V. /. fcfp-t, FAR (False Acceptance Rate) t FRR (False Rejec tion Rate) t?fcS. FAR £ FRR ft, SVMd N U- K*. xp (Ot+2:T,Nt+2:r|ft+i,Nt+i). (33). ^^*>5^-e±^it^j3t^iS:#: Threshoi ', ROC (Receiver Operating Characteristics) ft. >i,t =p(Ot+l:T,Nt+l:T|«t = -Hj.Nt) iot, 5£ (33) tt3£ (8) ©&£&#, p(Nt+i|Nt)p = ct+i,t =. (39). (34). Fiifl. , p (Oo:T5g* = Hj-jNoit) = Qij.t • /5j,t <^ ^ ^ i lifrlRJc?. 4.2 El 4 IC, £«¥& (Proposed) AT*,. P (O0:T,qt = HOi N0:t). ao.t • A),t. (35). 7-f/P*©*|*|*«£ (¥^b TOT*. ], LTSD (Long-Term Spectral Divergence) [3], ITU-. T G-729 Annex B [10], ETSI ES 202 050 [11],. (Previous)[71 O***:^. **, Mk[10],[ll]. y^A—f [9]. (36). J.*,i) (37) Jt,j,k,i. (38). , VAD. _A_,_. 5. i o?f-|9C$rVfe^^^^ J: 5 HOfcWfcfT 5 75>, SNR ^ J: 0 fEQ!SM(c:. ^. *6 = 0,5,10,15,. tz, tb =. 4.. H. 4.1. 31 8$ ft. @l -^^-^ 2,292 Jgfg (178. (,5,. 1118. 5. t? -r. , a<ti = {0.8,0.2,0.1,0.9} fc t,. -17-.
(6) Sohn - - LTSD X 0.729 Annex B. Sohn. X. + ETSIES202 050. +. Previous Proposed. 0.1. 0.1. 0.2. 0.2. 0.3. 0.4. 0.S. 0.6. 0.7. 0.8. 0.9. 0.2. 0.3. 0.4. 0.5. 0.6. 0.7. False acceptance rate. False acceptance rate. (a) Airport 0 dB. (b) Airport 5 dB. 0.3. 0.4. 0.5. 0.6. 0.7. 0.8. 0.9. False acceptance rate (d) Street 0 dB. 0.2. 0.3. 0.4. 0.5. 0.6. LTSD. G.729 Annex B. ETSI ES 202 05C Proposed. 0.8. .O 0.1 0.2 0.3 0.4 0.5 "oi 0.7 0.8 0.9 1.0. 0.9. False acceptance rate. (c) Airport 10 dB. 0.7. 0.8. 0.9. 0.0. 0.1. 0.2. False acceptance rate. 0.3. 0.4. 0.5. 0.6. 0.7. 0.8. 0.9. 1.0. False acceptance rate. (e) Street 5 dB. (f) Street 10 dB. ROC tijftic ££££&& ing long-term speech information," Speech Communication, Vol. 42, pp. 271-287, Apr. 2004.. [4]. Ishizuka, K. and Kato H.,. "A feature for voice activity. detection derived from speech analysis with the exponen tial autoregressive model," Proc. of ICASSP '06, Toulouse, France, Vol. I, pp. 789-792, May 2006.. [5]. Sohn, J. , Kim, N. S., and Sung, W., "A statistical modelbased voice activity detection,". IEEE Signal Processing. Letters, Vol. 6, No. 1, pp. 1-3, Jan. 1999.. [6]. Ephraim, Y. and Malah, D., "Speech enhancement using a minimum mean-square error short-time spectral amplitude. estimator,". 7Varw. on Ac.ousL, Speech, Signal Processing,. Vol. ASSP-32, pp. 1109-1121, Dec. 1984.. [7]. m* mm,. [8]. M. S. Arulampalam, S. Maskell, N. Gordon, and T. Clapp,. , 1-2-17, pp. 33-34, Sept. 2006.. "A Tutorial on Particle Filters for Online Nonlinear/NonGaussian Bayesian Tracking," IEEE Trans. SP, Vol. 50, No. 2, pp. 174-188, Feb. 2002. [1]. Rabiner, L. R. and Sambur, M. R., "An algorithm for de. termining the endpoints of isolated utterances,". [9]. Balakrishnan, A.V., "Kalman Filtering Theory," Optimiza. [10]. ITU-T Recommendation G.729 Annex B., "A silence com. The Bell. System Technical Journal, Vol. 54, No. 2, pp. 297-315, Feb.. tion Software, 1987.. 1975.. [2]. activity detection using higher-order statistics in the LPC. [3]. pression scheme for G.729 optimized for terminals conform. Nemer, E., Goubran, R., and Mahmoud, S., "Robust voice. ing to Recommendation V.70," Nov. 1996.. [11]. ETSI standard document, "Speech processing, Transmis. residual domain," IEEE Trans, on Speech and Audio Pro. sion and Quality aspects (STQ), Advanced Distributed. cessing, Vol. 9, No. 3, pp. 217-231, March 2001.. Speech Recognition;. Ramirez, J., Segura, J.C., Benitex, C, de la Torre, A., and. rithm; Compression algorithms," ETSI ES 202 050 v.1.1.4,. Rubio, A., "Efficient voice activity detection algorithm us. Nov. 2005.. -18-. Front-end feature extraction algo.
(7)
関連したドキュメント