Source Adaptive Blind Signal Extraction Using Closed-Form ICA for Hands-Free Robot Spoken Dialogue System
全文
(2) 出e separation accuracy. This s甘ategy is very reasonable because ad vance SO・lCA c,m separate the sources to some 巴xtent regardless of the sources‘PDF lthere is no activation function in SO-ICA),創ld then we can identifv PDFs after SO-ICA. Hereinafter we describe the detailed algori出m 3.2.. fïrst stage: c1osed-form SO・ICA. This subsection briefty describes出e overview of signal processing in the c1osed-fonn SO-ICA (see Ref [4] for more details). F汀st, we obtain the∞汀巴lation matrices wi出 di仔erent tJme pOlllts as Reference Palh. R"ω =. Fig. 1. Block diagram of BSSA. where. superposition of. tlvatlo日nmction estimation requires additional huge complltations because this method should iteratively lIpdate出e activation function. R" (.η、which is represented as. L:R川 =. form along with ICA's iterations;出is results in a great drawback for real-time application. Hence typical llse of lCA simply sllbstitutes the activation functio日with fixed function‘e.g., tanh(-) for speech. for pseudo-inverse of�;. 81ind spatial subtraction array 191. 仕act a target speech and redllce noises which cannot be regarded as point sOllrces. Although the conventional lCA-based BSS cOllld work especially in point sOllrces mixing, it is diffìclIlt to apply lCA. L(刀LlηH E死刑判(1 /"ぐに1/. (8 ). ..ß.二.).. It can be proved [41 that if theωvanan∞of出巴 sources. to non-pomt sour∞noise reductioll. BSSA has been proposed to. in " is negligible, every. extT<Jct a target spe巴ch in such a case. In BSSA, ICA is partly uti・. (9). S(j,I). R" (j)L(j) for any i shares the same. L(j)H. eigenvectors, and this is given via SVD fonn as. lized as a Iloise己stirnator because 0f the fact伽t 1CA is proficient. R,,(j)L(j) = T(j)diag(σ1 (1,).σ�(I,ふ. LU)H. in noise estimation rather tllan target estimation [9]. BSSA consists of two paths; a dclay-and-sum (DS) a行ay based primary path as tar gct specch eぱlancing part, and lCA-bascd reference path as noise. .)T(j)H,. whereの(1,) are the cigenvalues for a specifi.c time block 1;, and. estlmauon pa口(sce Fig. 1; FD-ICA is freqllency-domain HO・ICA and PB means pr�リcction back opcration using W(f)一1). Bascd on. 00) T(j). denotes the matrix consisting of shar巴d cigenvectors which are inde. pelldenl of time-block index i. Therefore, for any i. the simultaneous. can. diagonalization of. bc given by. s. (7). R,,(j) as follows. [L:R川「 L(j). 1n a hands-free system at a real environment, it is required to ex. r (IYos(f,1)11 - Y.IZ(f,')11}! � = ({,I) YSSSA (if I Yo (f, /)12 -γ・IZ(/, 1)12と0), 16 . I YllS(j, 1)1 (otherwise),. )lJ1ωH. ing of the eigenvectors. Then we obtain a full-rank decomposition. nals like non-speech noises白. YBSSA(/,I). lJ(j)diag(λ1,丸. where Àk are the eigenvalues, diag(λ 1,…) denotes the diagonaI ma trix which incJudes the eigenvalues. and lJ(j) is the matrix∞nSlst・. signal [6、8]. This leads to a notable mismatch and bad convergence. especially when we are con合onted with more general acollstic sig・. Rt,(j) can be achieved as follows;. T(f)HL(j)H R"げ)L(j)T(j). = diag(σ1 (1,),σ2(1,), ...),. (1 1 ). ,md this me,ms that 出e optimal separation filter matrix in山e 2nd. l5). order sense is given by Wso(j). where Yos(j, 1) is the output signal from the primaη! path. Z(に1) is the OUlpllt sign凶仕0111 the refヒrence pa出, y represents over. =. (L(j)T(j))H.. (12). Note that, for the calculation of T(j) in (10), it is Sll侃cient for us to only apply a single SVD to an arbilraヴsingle time-block. subtraction paranleter, and ó denotes the日oormg par出neter. 1,. be. cause of the eigenvector-sharing prop巴町y. So the total caJculation. of closed-foml SO-ICA is quite few.‘almost the same as tl1at of one. 3. PROPOSED METHOD 3.1.. (-)旧, denotes the time-averaging operator over speci自c lIme = 1.2司 represent indices of time-averaging block. Ncxt‘we apply the singular valuc decomposition (SYD ) to a. information of sources 'PDFs in a BSS context. Also ML-based ac. the spcctral subtraction method‘the BSSA's Olltput. ( 6). duration 1" and i. estimate it via ML schcme [7]. Howevcr, we could not know a priori. 2.3.. (x(f./)x(f.I)H)陪い. iteration in HO-TCA. iHotivation 3.3.. In a previous study, closed-form solution of SO-ICA was proposed. by one 01' Úle authors [4]‘who showed that sirnple algebraic calcu・. Sccond stagc: kurtosis-bascd activation function estimation. After c1osed-form SO・ICA, the roughly separated SOl汀ces can be ob. lations 巴nable the separation of mixed 剖gnaIs without iterative fìlter. tained. Therefor, we can estimatePDFs of sources via the following. updati.ng. This fìnding has motivated us to combine c1osed-foml SO. generaIized Gallssian distribution (GGD) [1 1] modeling and its kur・. ICA and source'sPDF estirnation with few computational costs. tosis. This result ofPDF estimation will be utilized for HO・JCA in the next stage (3rd stage). I-Iereafterバ1) me,ms a reaI p回of the. Ollr algorithm∞nsists of three stages, namely, c1osed-fonn SO・ ICA for roughly separating the sources, kurtosis-based activation. separat巴d signaI. function estimation applied to the roughly separated signals, and. GGD is a ftexible family of PDF modeling with some variable. post-HO・lCA with the optimized activation fimction for increase of. parruηeters. and GGD can represent various types of well-know引. 3682. ハU F円υ 句lム.
(3) ( 13). whcre Idf is a gamma lìmction‘αis sca/e parame ter‘and ß is shape paramefer of GGO. Figure 2 shows examples 01' di仔"erent POFs in GGO; note thatβ= 2 co汀esponds 10 Gaussian POF and thatβ= 1 coπesponds to Laplacian POF If 50urce's PDF obeys GGO‘we can easily derive the appropr卜 at.e actÍvation function as. {生| ごr- I ‘ {ごと0), 10g(J..叫::a, ß)) = � aI' c.Þ(たむ(z;α,ß)) = -� 。二 l - ;;jd.:f- I , (zく仇. (14). ドigl1rc 3 dcpicts cxamplcs of activation nmctions for GGD The estimation of the shapc paramcterβis the m05t important issue here becaus巴βdominantly determine the activation [unction (see (14)) but. the scale parameterαbecomes negligible through scale normalization. We can introdllce kllrtosìs to estimateβ In general kllrtosis of sig.nal Y(/) is given by kuはlV(/)) = 0片(/)},(I,2(/));2. - 3.. (15). From (15) and (16) we have. l. I \β1. 2 {5\_{ 1 \_(3 \ kurt(y(/)) = fI � 1[1 ';; Ifl � 1 - 3. 1βI \βI \β/. n '-. -_ ...可 �-. --2 -1.5 -1. -0.5. 0 z. 0.5. 1. 1.5. 2. Fig, 2. GGO with typical scale para11leterαand shape parametelβ 3. 2.5 2 1.5 3 05 f .. o � _8 .0.5 f 'ë'. .."'・ ....-..'・. ... ・ 1 1..ギエー・ーーーーーーー_,. -2.5. 4-. 1. 0.5. o z. ート05. _ - �=1 一. � = 2 0.5. Fig, 3. Activatio目白nctlo日for GGD witl1 typical shape para11leterβ. The n-th order moment of GGD has the following useful relalion勾 ship;. 1 {n+I\_{I\{三}=rTI ::"';;-:' l r l .;; 1 . β. 内d η 4 { 且6 50 3. ß {II二 - ':]r\ ん,;(ご司 a, β) =� expl 一 | ー一一 1 1‘ J 2aq � ) \ 1 α I -= is山111can of var凶Cご, [(x) = .r叫(-()r. kd. aE= ααα. nU 2 1 = g p z OP E np 内3 i 5 0 1 《U 《u nu. POFs. e.g‘Gal1ssian and Laplacian distributions. The de抗nition 01 GGO is as follows. ( 16). .t. EXPERJME:\TAL E、'ÄLUATION FOR ALGORITHM. To evaluate the efficacy of the proposed method, we ca打ied out noise reduction experiments in a real reverberant room where two omn卜 directional microphones are set. Th己 rev己rberation time (RT) in this roorn is 200 日IS. Target speech signal is aπiving fron】a fìxed d汀己ひ Oon‘and spoken by two male and two j(:male sp巴akers. As for the noise‘a di仔llS巴 noise recorded in an actual railway station is ernitted 仕om surrounded 36 loudspeakers. Noìse redllcliol1 rate (NRR)[8]. de自ned as the output signal-to-noise ratio (SNR) in dB 11linllS the input SNR in dB‘15 us巴d as the objective indication of sepiぜatlO11 perfonn自nce Figure 4 shows the convergence curves of NRR (or noise estima・ tion part in BSSA (batch processing for 3-second data). We cornpare simple lCA (acti\'ation function is fixed to talù1(・)), lCA with ML bぉed activation function estill1ation, and the proposed method. The horizontal axis represents the total computational cost which is al・ rnost eqllal to出e number of iterations in由eHO-ICA part ll1ultiplied by the nllmber 01'仕equency bins, including additional computations to estimate the activation function. From出e results‘we c.m see that the proposed method has a fast and high convergence performance Figure 5 depicts the resultant NRR of sp巴巴ch extraction in BSSA output, where this BSSA is implemented to be in real-tim己[IOJ. To sirnlllate tJle rcalistic spoken dialogue systell1, we made a旬以II1put ωnsisting of noise only periods (0-35 s and 55-100 s) and noise speech mixing parts (35-55 s、speech DOA is -200• and 100-120 s‘ speech OOA is 2(0). We confinn the proposed method's great effi cacy still in real-time operation. (17). This is a monotonically decreasing functioll ofβ Thus, we can es timate the shape paran1eterβby measuring kurtosis加d using an inverse relationship of (17) in table-Iookup rn創lIler In surnmary, we rneasure kurtosis of each of separated signals by closed-form SO-ICA, and thell we can adaptively determine the cor・ respollding activation function by (14) for each sound source. The requ汀ed cornputations is only one kurtosis calculation just after SO1CA,包1d consequently our method is cornplltative efficient in com parison to ML-based activation fl1nction estirnation. Possible draw back of the proposed me出od is POF estirnalion eπor due 10 poor 5巴paration accuracy in SO・1CA. Thus, POF estirnation performance is hig凶y related to degree of eas巴in source separation, e.g., rever benmt conditions 304, Third stage: no刷:Iosed-form HO・ICA. The separation filter matrix WsoCD obtained by SO-lCA ofl己n pro vides insufficient source-separation perfonnance. To polish叩the separation filter matrix and gain the further perfonnance‘we propose to cornbine the llonclosed-form HO-ICA after closed-foml SO-ICA employing tJle sourcゃPOF adapted activ31ion function. This strat egy regards the separation filter ll1atrix Wso(ηas an lJ11t凶value for HO-ICA"s iterative leaming given by (3) In general‘HO-ICA su釘ers合om 1lJ1 problem of tl1e poor and slow c011vergence of nonlinear optimization. 1n the proposed method, however, preceding closed-fonn SO-ICA can give a better initial state tor HO・ICA, alld the previous POF estÍmation enables HO・ICA to use more appropriate activation function. This combina tion mitigates the drawbacks on出e poor convergence. 5. HANDS-FREE ROBOT SPOKEN DIALOGt;E SYSTEM. Rec巴ntly we develop a hands-什ee robot spoken dialogue system 11シ ing the proposed real・time BSSA, which is mainly used for railway station gllidance in a noisy en、.iron11lent (see Fig. 6) To evaluate the system, speeιh recognition test was conducted. 3683. 可14 phd.
(4) 事I (a) �215 ・p_ ._. o._. -o-.._. o・・�:.�:••ijI-:',-,,-o,-,-o--,OW-. .. I � 101'. 1 /'/. 歩 . ←・-�T. 3 |. 0.5. 20. .. +lu + lu wM 1.5. 1.0. 2.0. 2.5. 痔 ��151I ぷf:/'γ ι ト 0・・ロ・-0-,,- . 0・4・ � 1O V" -1>臥 臥 : 山ω 山州刷 ti 肋h ' j" � I .�'. ;_. ....... 0.5. 1.0. Fig. 6. Appearance of robot spoken dialogue system. 1.5. Computational cost. 20. E80. t25. 官. ,md. U. 室I. Fig.4. Noise estimation performance where speech direction is (a) -400. ;;;l(b). � 85� (a). Pr悶。posed me副thod. (b) 300. �. DS. BSSA. ICA. 3:. 70 65. Fig. 7. Comparison of prcproccssing mcthods in (a) word correct, and (b) word accuracy. -<>_. BSSA. where the activation function in HO・ICA is optimized by using information合om c\osed-fonn ICA. This enables us to improve the separatlOn accuracy Wl出 saving the computationalιosts. Secondly we demonstrate our recently developed h,mds-世ee robot spoken dialogue system, and show that the proposed system can work under ,m. advarse condition like railway station env江onments. 7. REFERENCES [1] K.N紘adai, et al., 'Applying scattering ilieory to robot audition system: robust sound source locaJization 3.11d extraction," Proc. Fig. 5. Real-time implementation results, where fixed activation. lROS-2003, pp.1 147-1152, 2003. function is used in ICAωdBSSA. [2J R. Prasad, et aJ.. "Robots出at can hear, underst3.11d ,md taJk,'‘. Advallced Robotics, vo1.l8‘pp.533-564、2004 [3] P. Comon,“Independent component analysis、a new∞ncept?,". in the rcvcrbcrant room whcre RT is more than 400 ms. The tar. Sigllal Proce白.�sing,vo1.36, pp.287-314,1994 [4] A. Tanaka, et al.,“Thcoretical foundations of second-order. gct specch is talked in合011t of a microphonc a汀ay and 1. 5 m apart Wc usc 5 spcakers (250 words). as. thc targct uttcrances.. As for. statistics-based. noise, two noiscs were addcd simultaneously. First noise is a di仔use noise rccorded in an actual railway station cmittcd from surroundcd. for. non-stationary. .. frequency domain," Neurocomputing, vo1.22, pp 21-34. 1998 [6] S. Ikeda 3.11d N. Mlぜata,“'A me山od of blÍ11d source separation. an interference speech located at 50 degrees in the right direction of the microphone array, and its distance is 2.0 m. An eight-element. based 011 temporal structure of signaJs," Proc. JCONJP, pp. 73 7-. us ed. 742.1998 [7] S. Haykin (ed.), Unsupe門isedバdnptive乃lferi噌, VolulI/e 1. Figure 7 gives a∞mparative assessment example合om the view point of preprocessing microphone array methods, i.e., the conven. 8lind Source Separatiol1. 10hn Wiley & Sons, 2000. tional DS, ICA, or th巴proposedBSSA. The results reveal that both. [8] H. Saruwatari ct a1., “Blind source separation bascd on a. the word co汀ect and word accuracy 01'也e proposed BSSA are ob. fast-convergence algorithm combining ICA and beamfonuÍ11g,". viously superior to those of the conventionaJ DS and ICA, and our. IEEE Trans. Audio, やeech合Language Process.. 、'01.14,. proposed system notably sustains出e recogmtlOn aCCI江acy of more. pp. 666--678, 200 6 .. thω.1 80%. The demo11strationlll ovie of th巴robot diaJogue system is. [9] Y T:討<ahashi, ct al., ,B ‘ lind spatiaJ subtraction aηay with inde. available in the following URL. Readers canω凶rm that the fluent. pendcnt componcnt analysis for hands-frce spccch recognition,". conv巴rsa1:lOn Demo video:. sourcc separation. [5] P Smaragdis,“BIÍ11d separatio11 of convolwd mixtures in th巴. 8 loudspcakers (it simuJates railway-station noise). Second noise is. array with the interelement spacing of 2 cm is. blÍ11d. sources," Proc. JCASSP, pp.IlI-600ーIlI-603,2006. Proc. rW4ど'NC,2006. http://spalab.naist.jp/databasc/Dcmo/rtbssaj. [10] Y. Takぬashi, et al.,呪eaJ・tÍ1ne implementation of blind spatial subtraction a汀ay for hands-free robot spoken diaJogue system,". 6. CONCLUSION. Proc.lROS, pp.1687-1692, 2008. [11] G. Box. et aJ., Bayesian ]ll(erence in Slalislical Ana(vsiιAdi・. ln this paper, first‘ we proposed a new efficient BSS algorithm. son Wesley. Reading, Massachusetts, 1973. combining closed-fonn SO・ICA and source-PDF adaptive HO・ICA,. 3684. 円/ハM Fhu.
(5)
図
関連したドキュメント
At first, we explain about a virtual disparity image, which is used for estimating geometrical relation between road surface and stereo camera in the next sub-section. Now, we
Segmentation along the time axis for fast response, nonlinear normalization for emphasizing important information with small magnitude, averaging samples of the brain waves
Developed wear using conductive fabric. Power Supply Unit
We used this software package to estimate percentage dose reduction values of the average organ dose (indicated as 'Average dose in total body' in PCXMC) and effective dose for
Bae, “Blind grasp and manipulation of a rigid object by a pair of robot fingers with soft tips,” in Proceedings of the IEEE International Conference on Robotics and Automation
An existing description of the cartesian closed topological hull of p MET ∞ , the category of extended pseudo-metric spaces and nonexpansive maps, is simplified, and as a result,
Depending on the characteristic polynomial associated with a linear difference equation appearing during finding closed-form formulas for solutions to such a system, some of them
Adaptive image approximation by linear splines over locally optimal Delaunay triangulations.. IEEE Signal Processing Letters