Source Adaptive Blind Signal Extraction Using Closed-Form ICA for Hands-Free Robot Spoken Dialogue System

全文

(1)SOURCE ADAPTIVE BLIND SIGNAL EXTRACTION USING CLOSED-FORM ICA FOR HANDS-FREE ROBOT SPOKEN DIALOGUE SYSTE恥1. YiI 7iαkαhαsht， Hiroshi 8αruwatαrt， YlIki Fujiharat， Kel1lαro Tachibanat， Yoshimitsu Jvfori今. Shigeki Miyabe7， Kiyohil'O Shikα11(/， AkirαTωIαka+ tNara Institute of Science and Technology. Ikoma句Nara‘630-0192， JAPAN +Hokkaido University， Kita-14， Nishi-9， Kita-ku， Sapporo， 060-0814‘.JAPAN. 2. !\HXI!\G PROCESS A!\D CO:\YENTlO!\主L九'IETHODS. ABSTR.\仁T. ln this paper. W� propos� a n�w ICA-based ßSS algorit恰11 including. 2.1. 九悩xing proccss. estimation of sollrces' probability density functions (PDFs) to adapt. In this study. thc nllmbcr 01' l1licrophoncs is K and thc numbcr of. the nonlinear activation function 10、'arious noise conditions. ln the. multiplc sound sourccs is /，・日hcre \vc dcal with thc casc of K � }，. proposed llldhod， closed-form s�cond-order lCA is introduc�d as a. MuJtiple mixcd signals arc observed at thc microphone aπay， and thcse signals are convcrtcd into discrctc.til1lc scrics via an AjD. computational-ιost・eflìcient preprocessing to �xtract sources' PDFs、. which is benefìcial 1'or r�al-time application. Compared with variolls. convcrtcr. By applying thc sho吋-timc discrcte-lime Fouricr trans. type ol" conventional lCAs， e.g‘fìxed activation伊l"unction type and. form framcwiscl)七日e can cxprcss thc obscrved signals. in which. ML-based t)戸、our proposed algorithm can give a Ülster and higher. multiple source signals are linearly mixed‘as tollows il: the time. convergence. Based on the propos�d source-adaptive lCA， we sho\V. 凸equenιy domain. (j，f) = Aげ).5(1: 1).. x. spoken dialogue system via real-time ICA. where. Imlex Termç- Separation， speech enhancelll巴nt. acoustlc 51g. f and f repres�nI. 1. INTRODU‘TJON. ，. causc日e introduce a modcl to dcal with the rclativc time delays. among thc microphoncs and roOlll rcvcrbcrations. Blind sourcc separation (BSS) is the approach takcn to estimatc orig inal sourcc signals using only infonnation of thc mixcd signals ob・ scrvcd in cach input channcl. This techniquc is based on ulIsuper. vised fìltering in that thc sourcc-scparation proccdure rcquircs no training sequences and no a priori infon11ation on the directions-of. 2.2. ICA-bascd BSS Nextτwe perform signal separation lIsing the cOlllplex-valued lIn mixing matrix 11'(/)， so that the L time世ries outpllt. lVI(λハ... ， )'f.U: I)F. alTival (DOAs) of the sound sources. Owing to the attractive features. J'V，/) = W(j)x(j: 1).. 自elds of signal processing. One promising example in acollstic sig. naJ processing is a humanoid robot allditory system [1]， which con ln出IS paper， we propose a n巴w independent component analy. Various lCA methods for optimizing. probability density fllnctions (PDFs) to adapt the nonlinear acti. [5.6‘81. thc: optilllal. vate function to various noise conditions. Our previollsly proposed. equatlOn. [4叫1 is in附1. pared \Vilh various type of conventional JCAs， e.g.， fìxed activation. [7J.. noisc rcduction results llndcr rcalistic diffusc noise cnvironmcnt Also wc can demonstratc our reccntly dcveloped hands-仕cc robot. paramctcr.. 1 is an identity llIatrix司[m] is uscd to exprcss Ihc valuc of. the m-th step in the iterat ion ， (-)， denotes a tillle-a、'eragll1g operator Here is the appropriate nonlinear、iector function，. a.k.a.. φ(Y(に1)). actil'Gfion juncfion. Basically this向nction's element corre・. sponding to each sourcc. spoken d凶oguc system [10] for a railway-station guidance task via real-tillle lCA. φI(・). This work was partly supported by ihe. NEDO. project for slralegic de. velopmenl of ad、lance robotics elemental technologies、and MIC Slrategic Information and Communicntions. R&D Promotion Programme in Japan. 978-1-4244-2354-5/09/$25.00 �2009 IEEE. by the foll口同II1g llèratlve. l whcrc X dcnoics hcnnitian transpose of matrix X.μis thc stcp・sl7e. our proposed algorithm can give a fas ter and higher convergence. Based on the propos己d sourc己-adaptive lCA、we show a real-time. WHol!) is obtained. �J��;IIの=μ[1ー(叫が1)).内. extract sωou町l汀rceωs' PDFs. 引lis feature is benefìcial Jor real-time illlplel1lent四tion ofBSS. Com司. W(j) have be�n proposed. ln th巴 conventional frequ己ncy-domaill higher-order ICA OIO-ICA). sis [31 {ICA)-based BSS algoritlml including. estimation of sourc�s'. function匂官[5，6] and lllaximum likelihood (ML)-based type. l2). We perform this procedure with respect to all frequency bins. structs an indispensable basis for intelligent robot technology [2]. tωo. y(j， f) =. becomes mlltually independent‘this proce. dllre can b己glven as. ofBSS， llluch attention has been paid to the BSS technique in many. c印o】)m町mp似仰llta剖凶tlωo叩3χ淵nal卜-cα∞;(沿o路胤s叫tト-巴m 正cω;Ient pr民eproc∞es岱sm 巴. lrequency bin nllmber創ld time index. re. = [xl(j.f)・. ，Xf..'U: IW is thc observ吋signal vcctor， and s(j， 1) = [ .1'， (j f)“s，(j， f)]T is thc source signal vcc・ tor. Also， A(j) is thc lIlixing matrix whiιh is complex刊lucd be spectively‘x(fl). nal processing， adaptive signal processing， robot. closed-tonn secωond-order ICA (SO-ICA). ). AIso we can demonstrate our reccnlly developed hands-fT.::e robot. l (. a real-time noise reduction results under dilfuse noise environment. sflf，/) should be detennincd as. =. ι一同略s以仰 ψ ( 仏p肌.'""1 θßSj釘/げ，1ハ). 一一. l- -0 0\.. ). l4). where PS/ is PDF of the sourc己SIU: t). Thus. generally speaking in HO-IC A. \V� need to delemlÌne Ihe aιtivation function in advance， or. 3681. ICASSP2009. nud A斗・ 1E4.

(2) 出e separation accuracy. This s甘ategy is very reasonable because ad vance SO・lCA c，m separate the sources to some 巴xtent regardless of the sources‘PDF lthere is no activation function in SO-ICA)，創ld then we can identifv PDFs after SO-ICA. Hereinafter we describe the detailed algori出m 3.2.. fïrst stage: c1osed-form SO・ICA. This subsection briefty describes出e overview of signal processing in the c1osed-fonn SO-ICA (see Ref [4] for more details). F汀st， we obtain the∞汀巴lation matrices wi出 di仔erent tJme pOlllts as Reference Palh. R"ω =. Fig. 1. Block diagram of BSSA. where. superposition of. tlvatlo日nmction estimation requires additional huge complltations because this method should iteratively lIpdate出e activation function. R" (.η、which is represented as. L:R川 =. form along with ICA's iterations;出is results in a great drawback for real-time application. Hence typical llse of lCA simply sllbstitutes the activation functio日with fixed function‘e.g.， tanh(-) for speech. for pseudo-inverse of�;. 81ind spatial subtraction array 191. 仕act a target speech and redllce noises which cannot be regarded as point sOllrces. Although the conventional lCA-based BSS cOllld work especially in point sOllrces mixing， it is diffìclIlt to apply lCA. L(刀LlηH E死刑判(1 /"ぐに1/. (8 ). ..ß.二.).. It can be proved [41 that if theωvanan∞of出巴 sources. to non-pomt sour∞noise reductioll. BSSA has been proposed to. in " is negligible， every. extT<Jct a target spe巴ch in such a case. In BSSA， ICA is partly uti・. (9). S(j，I). R" (j)L(j) for any i shares the same. L(j)H. eigenvectors， and this is given via SVD fonn as. lized as a Iloise己stirnator because 0f the fact伽t 1CA is proficient. R，，(j)L(j) = T(j)diag(σ1 (1，).σ�(I，ふ. LU)H. in noise estimation rather tllan target estimation [9]. BSSA consists of two paths; a dclay-and-sum (DS) a行ay based primary path as tar gct specch eぱlancing part， and lCA-bascd reference path as noise. .)T(j)H，. whereの(1，) are the cigenvalues for a specifi.c time block 1;， and. estlmauon pa口(sce Fig. 1; FD-ICA is freqllency-domain HO・ICA and PB means pr�リcction back opcration using W(f)一1). Bascd on. 00) T(j). denotes the matrix consisting of shar巴d cigenvectors which are inde. pelldenl of time-block index i. Therefore， for any i. the simultaneous. can. diagonalization of. bc given by. s. (7). R，，(j) as follows. [L:R川「 L(j). 1n a hands-free system at a real environment， it is required to ex. r (IYos(f，1)11 - Y.IZ(f，')11}! � = ({，I) YSSSA (if I Yo (f， /)12 -γ・IZ(/， 1)12と0)， 16 . I YllS(j， 1)1 (otherwise)，. )lJ1ωH. ing of the eigenvectors. Then we obtain a full-rank decomposition. nals like non-speech noises白. YBSSA(/，I). lJ(j)diag(λ1，丸. where Àk are the eigenvalues， diag(λ 1，…) denotes the diagonaI ma trix which incJudes the eigenvalues. and lJ(j) is the matrix∞nSlst・. signal [6、8]. This leads to a notable mismatch and bad convergence. especially when we are con合onted with more general acollstic sig・. Rt，(j) can be achieved as follows;. T(f)HL(j)H R"げ)L(j)T(j). = diag(σ1 (1，)，σ2(1，)， ...)，. (1 1 ). ，md this me，ms that 出e optimal separation filter matrix in山e 2nd. l5). order sense is given by Wso(j). where Yos(j， 1) is the output signal from the primaη! path. Z(に1) is the OUlpllt sign凶仕0111 the refヒrence pa出， y represents over. =. (L(j)T(j))H.. (12). Note that， for the calculation of T(j) in (10)， it is Sll侃cient for us to only apply a single SVD to an arbilraヴsingle time-block. subtraction paranleter， and ó denotes the日oormg par出neter. 1，. be. cause of the eigenvector-sharing prop巴町y. So the total caJculation. of closed-foml SO-ICA is quite few.‘almost the same as tl1at of one. 3. PROPOSED METHOD 3.1.. (-)旧， denotes the time-averaging operator over speci自c lIme = 1.2司 represent indices of time-averaging block. Ncxt‘we apply the singular valuc decomposition (SYD ) to a. information of sources 'PDFs in a BSS context. Also ML-based ac. the spcctral subtraction method‘the BSSA's Olltput. ( 6). duration 1" and i. estimate it via ML schcme [7]. Howevcr， we could not know a priori. 2.3.. (x(f./)x(f.I)H)陪い. iteration in HO-TCA. iHotivation 3.3.. In a previous study， closed-form solution of SO-ICA was proposed. by one 01' Úle authors [4]‘who showed that sirnple algebraic calcu・. Sccond stagc: kurtosis-bascd activation function estimation. After c1osed-form SO・ICA， the roughly separated SOl汀ces can be ob. lations 巴nable the separation of mixed 剖gnaIs without iterative fìlter. tained. Therefor， we can estimatePDFs of sources via the following. updati.ng. This fìnding has motivated us to combine c1osed-foml SO. generaIized Gallssian distribution (GGD) [1 1] modeling and its kur・. ICA and source'sPDF estirnation with few computational costs. tosis. This result ofPDF estimation will be utilized for HO・JCA in the next stage (3rd stage). I-Iereafterバ1) me，ms a reaI p回of the. Ollr algorithm∞nsists of three stages， namely， c1osed-fonn SO・ ICA for roughly separating the sources， kurtosis-based activation. separat巴d signaI. function estimation applied to the roughly separated signals， and. GGD is a ftexible family of PDF modeling with some variable. post-HO・lCA with the optimized activation fimction for increase of. parruηeters. and GGD can represent various types of well-know引. 3682. ハU F円υ 句lム.

(3) ( 13). whcre Idf is a gamma lìmction‘αis sca/e parame ter‘and ß is shape paramefer of GGO. Figure 2 shows examples 01' di仔"erent POFs in GGO; note thatβ= 2 co汀esponds 10 Gaussian POF and thatβ= 1 coπesponds to Laplacian POF If 50urce's PDF obeys GGO‘we can easily derive the appropr卜 at.e actÍvation function as. {生| ごr- I ‘ {ごと0)， 10g(J..叫::a， ß)) = � aI' c.Þ(たむ(z;α，ß)) = -� 。二 l - ;;jd.:f- I ， (zく仇. (14). ドigl1rc 3 dcpicts cxamplcs of activation nmctions for GGD The estimation of the shapc paramcterβis the m05t important issue here becaus巴βdominantly determine the activation [unction (see (14)) but. the scale parameterαbecomes negligible through scale normalization. We can introdllce kllrtosìs to estimateβ In general kllrtosis of sig.nal Y(/) is given by kuはlV(/)) = 0片(/)}，(I，2(/));2. - 3.. (15). From (15) and (16) we have. l. I \β1. 2 {5\_{ 1 \_(3 \ kurt(y(/)) = fI � 1[1 ';; Ifl � 1 - 3. 1βI \βI \β/. n '-. -_ ...可 �-. --2 -1.5 -1. -0.5. 0 z. 0.5. 1. 1.5. 2. Fig， 2. GGO with typical scale para11leterαand shape parametelβ 3. 2.5 2 1.5 3 05 f .. o � _8 .0.5 f 'ë'. .."'・ ....-..'・. ... ・ 1 1..ギエー・ーーーーーーー_，. -2.5. 4-. 1. 0.5. o z. ート05. _ - �=1 一. � = 2 0.5. Fig， 3. Activatio目白nctlo日for GGD witl1 typical shape para11leterβ. The n-th order moment of GGD has the following useful relalion勾 ship;. 1 {n+I\_{I\{三}=rTI ::"';;-:' l r l .;; 1 . β. 内d η 4 { 且6 50 3. ß {II二 - ':]r\ ん，;(ご司 a， β) =� expl 一 | ー一一 1 1‘ J 2aq � ) \ 1 α I -= is山111can of var凶Cご， [(x) = .r叫(-()r. kd. aE= ααα. nU 2 1 = g p z OP E np 内3 i 5 0 1 《U 《u nu. POFs. e.g‘Gal1ssian and Laplacian distributions. The de抗nition 01 GGO is as follows. ( 16). .t. EXPERJME:\TAL E、'ÄLUATION FOR ALGORITHM. To evaluate the efficacy of the proposed method， we ca打ied out noise reduction experiments in a real reverberant room where two omn卜 directional microphones are set. Th己 rev己rberation time (RT) in this roorn is 200 日IS. Target speech signal is aπiving fron】a fìxed d汀己ひ Oon‘and spoken by two male and two j(:male sp巴akers. As for the noise‘a di仔llS巴 noise recorded in an actual railway station is ernitted 仕om surrounded 36 loudspeakers. Noìse redllcliol1 rate (NRR)[8]. de自ned as the output signal-to-noise ratio (SNR) in dB 11linllS the input SNR in dB‘15 us巴d as the objective indication of sepiぜatlO11 perfonn自nce Figure 4 shows the convergence curves of NRR (or noise estima・ tion part in BSSA (batch processing for 3-second data). We cornpare simple lCA (acti\'ation function is fixed to talù1(・))， lCA with ML bぉed activation function estill1ation， and the proposed method. The horizontal axis represents the total computational cost which is al・ rnost eqllal to出e number of iterations in由eHO-ICA part ll1ultiplied by the nllmber 01'仕equency bins， including additional computations to estimate the activation function. From出e results‘we c.m see that the proposed method has a fast and high convergence performance Figure 5 depicts the resultant NRR of sp巴巴ch extraction in BSSA output， where this BSSA is implemented to be in real-tim己[IOJ. To sirnlllate tJle rcalistic spoken dialogue systell1， we made a旬以II1put ωnsisting of noise only periods (0-35 s and 55-100 s) and noise speech mixing parts (35-55 s、speech DOA is -200• and 100-120 s‘ speech OOA is 2(0). We confinn the proposed method's great effi cacy still in real-time operation. (17). This is a monotonically decreasing functioll ofβ Thus， we can es timate the shape paran1eterβby measuring kurtosis加d using an inverse relationship of (17) in table-Iookup rn創lIler In surnmary， we rneasure kurtosis of each of separated signals by closed-form SO-ICA， and thell we can adaptively determine the cor・ respollding activation function by (14) for each sound source. The requ汀ed cornputations is only one kurtosis calculation just after SO1CA，包1d consequently our method is cornplltative efficient in com parison to ML-based activation fl1nction estirnation. Possible draw back of the proposed me出od is POF estirnalion eπor due 10 poor 5巴paration accuracy in SO・1CA. Thus， POF estirnation performance is hig凶y related to degree of eas巴in source separation， e.g.， rever benmt conditions 304， Third stage: no刷:Iosed-form HO・ICA. The separation filter matrix WsoCD obtained by SO-lCA ofl己n pro vides insufficient source-separation perfonnance. To polish叩the separation filter matrix and gain the further perfonnance‘we propose to cornbine the llonclosed-form HO-ICA after closed-foml SO-ICA employing tJle sourcゃPOF adapted activ31ion function. This strat egy regards the separation filter ll1atrix Wso(ηas an lJ11t凶value for HO-ICA"s iterative leaming given by (3) In general‘HO-ICA su釘ers合om 1lJ1 problem of tl1e poor and slow c011vergence of nonlinear optimization. 1n the proposed method， however， preceding closed-fonn SO-ICA can give a better initial state tor HO・ICA， alld the previous POF estÍmation enables HO・ICA to use more appropriate activation function. This combina tion mitigates the drawbacks on出e poor convergence. 5. HANDS-FREE ROBOT SPOKEN DIALOGt;E SYSTEM. Rec巴ntly we develop a hands-什ee robot spoken dialogue system 11シ ing the proposed real・time BSSA， which is mainly used for railway station gllidance in a noisy en、.iron11lent (see Fig. 6) To evaluate the system， speeιh recognition test was conducted. 3683. 可14 phd.

(4) 事I (a) �215 ・p_ ._. o._. -o-.._. o・・�:.�:••ijI-:'，-，，-o，-，-o--，OW-. .. I � 101'. 1 /'/. 歩 . ←・-�T. 3 |. 0.5. 20. .. +lu + lu wM 1.5. 1.0. 2.0. 2.5. 痔 ��151I ぷf:/'γ ι ト 0・・ロ・-0-，，- . 0・4・ � 1O V" -1>臥臥 : 山ω 山州刷 ti 肋h ' j" � I .�'. ;_. ....... 0.5. 1.0. Fig. 6. Appearance of robot spoken dialogue system. 1.5. Computational cost. 20. E80. t25. 官. ，md. U. 室I. Fig.4. Noise estimation performance where speech direction is (a) -400. ;;;l(b). � 85� (a). Pr悶。posed me副thod. (b) 300. �. DS. BSSA. ICA. 3:. 70 65. Fig. 7. Comparison of prcproccssing mcthods in (a) word correct， and (b) word accuracy. -<>_. BSSA. where the activation function in HO・ICA is optimized by using information合om c\osed-fonn ICA. This enables us to improve the separatlOn accuracy Wl出 saving the computationalιosts. Secondly we demonstrate our recently developed h，mds-世ee robot spoken dialogue system， and show that the proposed system can work under ，m. advarse condition like railway station env江onments. 7. REFERENCES [1] K.N紘adai， et al.， 'Applying scattering ilieory to robot audition system: robust sound source locaJization 3.11d extraction，" Proc. Fig. 5. Real-time implementation results， where fixed activation. lROS-2003， pp.1 147-1152， 2003. function is used in ICAωdBSSA. [2J R. Prasad， et aJ.. "Robots出at can hear， underst3.11d ，md taJk，'‘. Advallced Robotics， vo1.l8‘pp.533-564、2004 [3] P. Comon，“Independent component analysis、a new∞ncept?，". in the rcvcrbcrant room whcre RT is more than 400 ms. The tar. Sigllal Proce白.�sing，vo1.36， pp.287-314，1994 [4] A. Tanaka， et al.，“Thcoretical foundations of second-order. gct specch is talked in合011t of a microphonc a汀ay and 1. 5 m apart Wc usc 5 spcakers (250 words). as. thc targct uttcrances.. As for. statistics-based. noise， two noiscs were addcd simultaneously. First noise is a di仔use noise rccorded in an actual railway station cmittcd from surroundcd. for. non-stationary. .. frequency domain，" Neurocomputing， vo1.22， pp 21-34. 1998 [6] S. Ikeda 3.11d N. Mlぜata，“'A me山od of blÍ11d source separation. an interference speech located at 50 degrees in the right direction of the microphone array， and its distance is 2.0 m. An eight-element. based 011 temporal structure of signaJs，" Proc. JCONJP， pp. 73 7-. us ed. 742.1998 [7] S. Haykin (ed.)， Unsupe門isedバdnptive乃lferi噌， VolulI/e 1. Figure 7 gives a∞mparative assessment example合om the view point of preprocessing microphone array methods， i.e.， the conven. 8lind Source Separatiol1. 10hn Wiley & Sons， 2000. tional DS， ICA， or th巴proposedBSSA. The results reveal that both. [8] H. Saruwatari ct a1.， “Blind source separation bascd on a. the word co汀ect and word accuracy 01'也e proposed BSSA are ob. fast-convergence algorithm combining ICA and beamfonuÍ11g，". viously superior to those of the conventionaJ DS and ICA， and our. IEEE Trans. Audio，やeech合Language Process.. 、'01.14，. proposed system notably sustains出e recogmtlOn aCCI江acy of more. pp. 666--678， 200 6 .. thω.1 80%. The demo11strationlll ovie of th巴robot diaJogue system is. [9] Y T:討<ahashi， ct al.，，B ‘ lind spatiaJ subtraction aηay with inde. available in the following URL. Readers canω凶rm that the fluent. pendcnt componcnt analysis for hands-frce spccch recognition，". conv巴rsa1:lOn Demo video:. sourcc separation. [5] P Smaragdis，“BIÍ11d separatio11 of convolwd mixtures in th巴. 8 loudspcakers (it simuJates railway-station noise). Second noise is. array with the interelement spacing of 2 cm is. blÍ11d. sources，" Proc. JCASSP， pp.IlI-600ーIlI-603，2006. Proc. rW4ど'NC，2006. http://spalab.naist.jp/databasc/Dcmo/rtbssaj. [10] Y. Takぬashi， et al.，呪eaJ・tÍ1ne implementation of blind spatial subtraction a汀ay for hands-free robot spoken diaJogue system，". 6. CONCLUSION. Proc.lROS， pp.1687-1692， 2008. [11] G. Box. et aJ.， Bayesian ]ll(erence in Slalislical Ana(vsiιAdi・. ln this paper， first‘ we proposed a new efficient BSS algorithm. son Wesley. Reading， Massachusetts， 1973. combining closed-fonn SO・ICA and source-PDF adaptive HO・ICA，. 3684. 円/ハM Fhu.

(5)