Double-Talk Free Spoken Dialogue Interface Combining Sound Field Control with Semi-Blind Source Separation
全文
(2) Fig. 2. Configuration of the simple connection of BSS wi血 MO阻ぜI. method.. points Ck (ω) (k 1,・ー,K+ 2)卸described byM x ( K+ 2) matrix G (ω)whose en凶es are the room剛sfer functions gkm (ω). To reproduce the input signals r (ω) on the co曲01 points Ck (ω), we design 組Mx ( K十2) inverse filter matrix H (ω) by叫c山t ingMoore-Penrose generalized inverse I田町ix ofG (ω)composed of hmk (m 1 ,...,M, k 1,... ,K+2).τben, we truncate血e ma 肱H (ω) into H' (w ) which is anMx 2filter matrix composed of 也E創ter components hmk, (ω)(m =1,... ,M, k' =K+l,K+2) of H (ω). wì抽出s filter matrix, the following equation holds;. =. =. =. d (ω)=G (ω)H' (ω)r (ω)=[O,... , O , rR (ω),九 (ω)f. (2) 』ー~ーー〆. K. Therefore, on one hand, 血e r郎ponse sound signals equal the sig 凶s at the user's ears ([dK+l(ω) ,dK+2(ω) J =[rR (ω),九 (ω)J) 組d reproduced strictly. On the other hand, silent zones are realized at microphone elements (dk (ω) =0 for k = 1,...,K) and the re sponse sound is prevented台om being observed at也e microphone elements 百en, delay-組d-sum array signal processing is applied to the observed signals. Since也e MOMNI method uses an inverse filter of也e r∞m 位ansfer function,血ree dimensional sound field reproduction c組 be presented. To make full use of血is prope託y, we make也e response sound signals (TR (ω),九 (ω)) by multiplying the room回nsfer func tions gpri (ω) [gp組 (ω), rpriL(ω)f between a primary sound so町ce and bo血 of白e user's e紅s, and a monaural so町'ce of也e response sound signal r田 (ω) as. =. (3 ) [rR (ω),九 (ω)f =g伊(ω)r,町 (ω)・ τbis mechanism c組 present出E鉛町ce position of an agent of dia logue system wi血 high precision. 2.2. Response Sound EIimination E官。r羽'henChanging Room Transfer FunctiODS. TheMOl\⑪-n method can make its control robust against fiuctuation of血e room transfer白血ctions. Assume 血at the number of loud speakersM is enough larger也an the number of con住01 points, 組d 也e condition number of血e inverse filter matrix approaches to 1. 百四, it is proved血at血e elimination error after fiuctuation of r∞m 回nsfer function is in propo凶on to 1/v'M支[4].Therefore,也e ro bustness of血eMOMNI method against the room甘組sfer functions is improved by increasing the number of the loudspeakers and the microphone elements. 3. INTRODUC町G町DEPENDENTCOMPONENT. 也e MOMNI method can obtain not only improvement of robust ness against room仕組sfer function but also environmental noise or another talker. 百ough most of adaptive 町ay signal processings require inforrnation of single-talk duration, BSS based on indepen dent component analysis c釦leam its filter coe飴cients only合om observed signals. In血is paper we assume that there is no additional noise and discuss only elimination of the mix佃re of也e response sound in也e observed signal caused by fiuctuation of也e room回邸ー fer function. However, in case there is some additional noise, by increasing the number of microphone elements and size of白e filter m凶x of ICA,出.e proposed method obtains ability to separate也e user's speech台om也e additional noise. 3.1. SimpleConnection of BSS with MO島町IMe曲od. τbe most simple idea is just to connect BSS wi血血eMO恥制I method as shown in Fig. 2. We define 組M-dimensional vector gk (ω) (k 1, K) composed of room 回sfer func位ons gkm (ω) (m 1, M ) between the k-血 microphone element and all血eM loud spe出rs before fiuctuation.百en we define g� (ω)也.e room位ansfer function after fiuctuation given by g�(ω) = gk (ω)+ムgk(ω),. (4). whereムgk(ω)is a di伽印刷of gk(ω)組d g� (ω). If input signals are given by (3), gk (ω)H' (ω) =0 and observed signal Xk at k-也 microphone element is given by Xk (ω). =. g�(ω) H' (ω)gpri (ω)rm: (ω) + Sk (ω). = ムgk(ω)H' (ω)g伊 (ω)r眠(ω)+ Sk (ω),. (5). where Sk(ω) is a component of也e user's utter皿ce observed at血e b白microphone element. Equation (5) shows血at the number of independent signals included in Xk (ω) is two and回paration can be achieved by using two observed signals.τberefo印刷s method uses two microphone elements (K 2) 組d inputs observed signals of these miαophone elements to企巴quency-domain ICA伊D-ICA ). We define 2 x 2 separation filter ma肱W (ω) as. =. | ωll(ω)ω12(ω) 1 (ω), : " " ) ( I :I: IW21(ω)ω22(ω)1 :--:. y (ω) = W(ω):1: (ω)=1:""):--: (. (6). where two dimensional column vecωr y (ω) = [Yl (ω),Y2(ωW de scribes output signals. FD-ICA updates its filter W (ω) ω make its output signals statistically independent. The upd蹴of創ter coeffi cients are given by. ANALYSIS TO MO島町lMETHOD. In this section we propose 組 algorithm which apply ICA after the sound field control of theMOr.制1 method. The conventionalMOr.町I me血od adopts delay-and-sum町ay signal processing with fixed filter coe飴cients. Ifωme adaptive array signal pr∞essing is applied,. ==. • • •• ,• • ,. W++(ω) =W (ω)一η{ I一(φ (y (り))yH (り))t } W (ω),. (7) where W ++(ω) is也e upda包d filter, y (ω,t) is y (ω) observed at time t,( .) t is a time average operaωr,ηis a step-size 阿山町,φ. I・810. -110-.
(3) is an activati'On functi'On like p'Olar functi'On [5] given by. (8). E m 円. Y1(ω)1) exp (j arg(Y1 (ω))) 1 φ(y(ω)) = Irt印刷I �;�h( i め (ω)1) 町 (j 紅g( 仰い))) |. Since the gain 'Of each f民quency has arbi甘ariness in FD-ICA, its 'Out put si伊als are dist'Orted. T'O c'Ompensate f'Or this, projecti'On back [6]. p(ω) = [P1(ω),P2(ωw. processed by pr句ecti'On back can be written as. (. 必. p. ) ny. 、、tEtFノ 可E E」 EE EE ) LW O 2 NS ) ぃ o l uu plati-L ) ー(ω W 〆'saE、 、、 ohM 9“ 一一 ) ω (. is applied. In this case,也.e 'Output signals. ( - ) is an 'Operat'Orωmake a vect'Or c'Omposed 'Of必ag'On必. where diag. Fig. 4. Lay'Out 'Of ac'Oustic envir'Onment ro'Om.. c'Omp'Onents 'Of i臼argument.. In lear百ing 'Of FD-ICA, null-bearnf'Ormer with s'Ome reas'Onable directivity pattem is 'Often used as an initial filter. In additi'On, since. 百eref'O低the叫arati'On亘lter is 'Optimum. permutati'On ambiguity 'Occurs. T'O align the permutati'On, a direc. '0凶ywheω n 叫ω)ω / 川ω). identifìes the minus transfer functi'On between the input 'Of the inverse. fìlter c'Oefficients 'Of FD-ICA 'Of each企equency is leamed separately,. fìlter t'O the micr'Oph'One element. In fact, the 'Output si伊山 血e pr句ecti'On back in (9) is given by. tivity pattem 'Of 也e separati'On fìlter is utilized [7]. H'Owever, as sh'Own in (5), since 'Observed resp'Onse s'Ound is multiplied by n'Ot. p(ω) 'Of. p(ω) = fX1(ω)+詑151 h (ω) 1 r眠(ω) L J. ro'Om transfer functi'On but difference 'Of r'O'Om tra且sfer functi'On, it is di飴cult t'O fìnd reliable directivity pattems. T heref'Ore, we cann'Ot expect也is meth'Od perf'Orms as g'O'Od as 'Ordinary BSS.. 組d agree t'O. (13).. (14). On 'One hand,BSS aims t'O make an inverse fìlter 'Of. 3.2. Proposed Method: Semi-Blind Source Separation with Ob. 血E仕組sfer system and requires m'Ore fìlter length白血血E位ansfer. served S砲nal of a Microphone and Direct Input of Response. functi'Ons. T'O 'Obtain a g∞d pe巾m組ce with l'Ong fìlter length, FD. Sound. ICA req凶res l'Ong input signals.. Since the resp'Onse s'Ound. sig心 n r,rc(ω). is kn'Own f'Or the system,. we can use this signal as an input signal 'Of ICA. T heref'Ore, in the. 白at 'Of the transfer system.. In additi'On, since increasing血e number 'Of micr'Oph'One elements. pr'Op'Osed me白'Od, we use 'OnJy 'One microph'One element as sh'Own. in theMO.t.町1 method l'Owers the stability 'Of s'Ound fìeld c'Ontr'Ol,. 3 and leam the separati'On fìlter 'Of (6) in which :1:(ω) = [X1(ω), r,rc(ωW is subs帥ted. 百en, if we町t'O make 組 'Output si伊al Y2(ω) t'O include '0凶y 血e c'Omp'Onent 'Of r皿(ω) , that c'Ondi ti'On c組 be s抑制by setting W21(ω) = 0 because in Fig.. Y2(ω) = W21(ω)X1(ω)+ω22(ω)r眠(ω) =ω22(ω)r,rc(ω). Theref'Ore, by setting ω21. (ω) =. decreasing 'One microph'One element is benefìcial t'O 血e MOMNI me白od.. 4. SIMULATION h白is secti'On, we present tw'O experiments in which the prop'Osed meth'Od is c'Ompared with血e c'Onventi'Onal me也'Ods, i.e., an ac'Oustic. (10). 0 お組凶tial value, the learning. C姐be started仕om 也e st瓜.e where 'One 'Of the signals is already. ech'O canceller a且d血eMOMNI meth'Od, and the simple c'Onnecti'On 'Of BSS t'O theMOMNI meth'Od discussed in Sect.. (8). ro'Om transfer functi'Ons, we perf'Orm a resp'Onse s'Ound eliminati'On Then we evaluate the perf'Orm組ce 'Of each method 'On the basis 'Of. ch組ges the value 'Ofω21 (ω ) ,血E. (ω) = 0 in every iterati'On. By血is c'Onstraint伽tω21 (ω) t'O be zero, Yl (ω) is u凶ated t'O be statistically independent 'Of Y2(ω) W21(ω)r町(ω) and the. a speech rec'Ogniti'On experiment t'O verify白e applicability 'Of血e pr'Op'Osed meth'Od t'O a sp'Oken dial'O忠le system. semi-blind c'Onditi'On c組 h'Old by substitutingω21. Y1(ω) = C(ω)SI(ω) ,. 4.1. ExperimentalConditions. ) -Ea 1 (. =. independence is s甜sfìed when and '0凶y when. T'O validate. experiment in which changes in血E仕組sfer functi'Ons are simulated.. pr'Oblem is n'Ot blind n'Or unsupervised. We call it semi-blind s'Ource Al出'Ough 也e update 'Of. 3.1.. 血e robustl!ess 'Of白e proposed meth'Od against血e fluctu組'On 'Of也E. separated. Since the separati'On 'Of 'One signal is fìnished, n'Ow this separatJ'On.. On the '0出.er hand, the prop'Osed. semi-blind s'Ource sep紅ati'On requires 'OnJy equal leng血 'Of fìlter t'O. Figure 4 sh'Ows the arr組gement 'Of也e app紅atuses. We placed a dummy head, which has 組 average human head and釦upper b'Ody, at也e user's positi'On. We designed the fìlters used in曲eMOMNI. C(ω) is an arbi佐町value. Since Y1(ω) c組 be given by. 祖d血.e pr'Op'Osed meth'Od w出血e r,∞m transfer functi'Ons bef'Ore. Y1 (ω) =ωll(ω)X1(ω)+ω叫ω)rsn:(ω) =( ωll(ω)ムgk(ω)H'(ω)gpri(ω)十ω叫ω)) r,rc(ω) (12) ω + ll(ω)SI(ω),. fluctuati'Onぉi包fìlter c'Oe飴cíents,assumíng白紙i包adapta泊'On was. where. the c'Onditi'On. fluctuati'On. We gave 血e AEC the r∞m 甘ansfer functi'Ons bef'Ore. transfer functi'Ons.. H'Owever, after the fluctuati'On,也e adaptati'On. c'Ould n'Ot be perf'Ormed due t'O d'Ouble-talk. We evaluated the per f'Ormances with the average 'Of. (11) yields. ( ωll(ω)ムgk(ω)H'(ω)gpri(ω)+ω12(ω)) r",,(ω) =0 (ω) 一一一 = -ßgk(ω)H'(ω)gpri(ω). ll(ω). performed accurately with'Out err'Ors bef'Ore由e fluc制ati'On 'Of the. 12 kinds 'Of impulse resp'Onses caused 30 cm. by m'Ovements 'Of a m組neq凶n. The interelement spacing was. W也 1 血e c'Onventi'Onal theMO�町1 meth'Od, and 6 cm with the sim ple c'Onnecti'On 'Of BSS.τ'he sampling f民quency was. (13). 1. 16 kHz. In the. learning 'Of ICA, we used the input signals 'Of early 5 sec'Onds. The length 'Of血e separati'On fìlters is 2048 taps.. - 811.
(4) 0 5 国 』 (a) (b). (a). (c) Methods. (d). US訂's speech Resp.onse S.o四ld. 11. (d). (e). S. CONCLUSION We prop.osed a semi・blind s.ource separati.on a1g.orithm and applied it t.o血e sp.oken dia1.ogue interface using s.ound field c.on甘.01. As the resu1ts .of血e experiment,血e r.obustness .of s.ound eliminati.on and the perf.ormance .of speech rec.ogniti.on improved wi血 the prop.osed meth.od. From these findings,也e e飽cacy .of也e prop.osed meth.od is ascertained.. I'able 1. Experimenta1 ∞nditi.ons f.or speech rec.ogniti.on. 11 11 11 11 11 11. (c) Methods. Fig. 6. C.omparis.on .of WAs .of (a) ac.oustic ech.o canceller, (b) MOt.町Iwi也 1 micr.oph.one micr.oph.one element, (c) MOr.町Iwith Delay-and-sum wi血 2 elements, (d) simple c.onnecti.on .of ICA and MOt.町Iwi白 tw.o microph.one elements組d (e) Pr.op.osed method.. (e). Fig. S. Comparis.on .of SNR出組d SNR.,u, .of (a) ac.oustic ech.o canceller, (b)MOMNI wi血 1 microph.one micr.oph.one element, (c) MOMNI with Delay-and-sum with 2 elements, (d) simple c.onnec ti.on .of ICA and MOr.町1 with tw.o micr.oph.one elements and (e) pr.op.osed meth.od.. Task Fea加古vect.or Language m.odel Ph.oneme m.odel DeC.oder. (b). Newspaper dictati.on fr.om刑AS [8] 12 MFCCs, 12ムMFCCs, ßp.ower Newspaper d叫ati.on with 20,0∞ w.ords Ph.oneticτ'ied Mixture (PTM) [8] Julius ver. 3.4.2 standard [8]. 6. REFERENCES [ 1] E. Hänsler, “'Ac.oustic ech.o and n.oise c.ontr.ol: where d.o we c.ome fr.om - where d.o we g.o?," in Proc. 7th 1nterna tional Workshop on Acoustic Echo and Noise Control, pp. 1-4, September2∞1.. 200 sen旬nαs (23 males and 23 fema1es) fema1e utterance. 4.2. Eval幽tion of Response Sound EIimination We eva1uated signa1-t.o-n.oise rati.os .of the .observed signa1 (SNR。ω 組d fina1 .output signa1 (SNR.,u') .of也e system in Fig 5 . These SNRs are just血e p.ower rati.os .of血e user's speech and the resp.o邸e s.ound. τ'heref.ore, dist.orti.on .of spectrum d.oesn't influence these sc.ores. W hen tw.o microph.one elements are used, we eva1uated their average. Regarding SNR.,b" the resu1t .of .one microph.one element sh.ows higher perf.orma且ce血組 tw.o micr.oph.one elements. H.owever, by the effect .of delay-and-sum町ay signa1 processing, SNR叩, .of tw.o eleme凶 is rec.overed t.o血e same level .of .one element. This revea1s 血at 也e c.onditi.on .of eight l.oudspeakers and tw.o microph.ones is a hard c.on diti.on f.or stable c.on佐ol .of血eMOMNIme出 .od, and its perf.ormance d.oesn't agree wi也白le law described in Sect. 2.2 出 at eπ.or is propo凶on剖to 1/ゾM K. In the simple ∞mbination of BSS and the MOMNI me血.od, BSS cannot improve SNR側仕.om its input because .of its p.o.or initia1 filter 組d diffic叫ty in s.oluti.on .of permutati.on. H.owever, the pr.op.osed me血.od impr.oves SNR.,u, c.onsiderably. This sh.ows白at血e e筒cacy .of semi-blind S.o町ce separati.on. 4.3. Speech Reco伊ition Experiment. [2] S.Makin.o組d S. Shimauchi,“Stere.oph.onic ac.oustic ech.o can cellati.on一組 .overview and recent s.oluti.ons," in Proc. The 19991EEE Workshop on Acoustic Echo and Noise Control, pp. 12-19 , September 1999 . [3] W. Herb.ordt, J. Yìng, H. Buchner, and W. Kellermann,“'A rea1time ac.oustic hurnan-machine合ont-end f.or mu1timedia applica ti.ons integrating r.obust adaptive beamf.orming and stere.oph.onic ac.oustic ech.o cancellati.on," in Proc. 7,的1nternational Conf. on Spoken Language Processing, v.ol. 2, pp. 773ー776, September 2002 [4] Y. Hin皿1.ot.o, K.Min.o, H. Saruwat紅i,祖d K. Shik姐.0, “Inter face f.or barge-in f民e spoken dia1.ogue system based .on s.ound field c.on位。1 and micr.oph.one array," in Proc. 2003 1EEE 1n temational Co,!戸on Acoustics, Speech, and Signal Processing, vol. 5, pp. 505-508, April 2003. [5] H. Sawada, R. Mukai, S. Aaraki, and S. M止血.0, "Polar c∞r dinate based on n.onlinear function f.or台equency d.omain blind s.ource separati.on," 1E1CE Trans. Fundamentals, vol. E86-A, no. 3, pp. 59 0-596, March 2003.. [6] N. Murata 祖d S Ikeda, “'An On-line Algorigh位n f.or Blind Source Separati.on on Speech Signa1s," in Proc. 1998 1nter. The effect .of血e response S.ound elimin鈎.on is eva1uated using a large v.ocabu1ary c.ontinu.ous speech rec.ogniti.on task. T.o eva1uate the speech rec.ogniti.on perf.ormance, we ad.opt w.ord accuracy (WA) as a且 eva1uati.on sc.ore[8]. Table 1 lists the experimen凶c.onditi.ons f.or the speech rec.ogniti.on. national Symposium on Nonlinear Theoηand its Applications,. vol. 3, pp. 923・926,September, 1998.. Figure 6 sh.ows the WAs wi血 all the c.ombinati.ons. All 血E sc.ores in the graph are a1m.ost proporti.ona1 t.o th.ose .of SNRs except f.or the simple c.onnecti.on .of BSS and白MOt.町1 meth.od. Be cause .of 白e permutati.on discussed in Sect. 3.1, simple c.onnecti.on has large dist.orti.on組d its perf.ormance is w.orse 出組曲eMOMNI meth.od. The proposed me血.od is n.ot S.o much affected by permuta ti.on and sh.ows the highest perf.ormance. 1-. [ 7] S. Kurita, H. Saruwa凶ri, S. Kajita, K. Takeda,組d F. ltakura, “Eva1uation .of blind signa1 separati.on meth.od using directivity pattem under reverberant conditi.ons," in Proc. 2000 1EEE 1n ternational Conf on Acoustics, Speech, and Sigrω1 Processing,. vol. 5, pp. 3140-3143, June 20∞.. [8) A. Lee, T. Kaw油ara, and K. Shikan.o,“J凶ius - an open 50町民 rea1-time large vocabulary rec.ogniti.on engine," in Proc. 7th Eu ropean Conf on Speech Communication and Technolog y, v.ol.3, pp.169 1 -1694, September 200 1 .. 812. 円〆臼 taム 11A.
(5)
図
関連したドキュメント
We classify groups generated by powers of two Dehn twists which are free, or have no “unexpectedly reducible” elements.. In the end we pose similar problems for groups generated
In Section 5, we study the contact of a 1-lightlike surface with an anti de Sitter 3-sphere as an application of the theory of Legendrian singularities and discuss the
2 Combining the lemma 5.4 with the main theorem of [SW1], we immediately obtain the following corollary.. Corollary 5.5 Let l > 3 be
Kraaikamp [7] (see also [9]), was introduced to improve some dio- phantine approximation properties of the regular one-dimensional contin- ued fraction algorithm in the following
A., Some application of sample Analogue to the probability integral transformation and coverages property, American statiscien 30 (1976), 78–85.. Mendenhall W., Introduction
In solving equations in which the unknown was represented by a letter, students explicitly explored the concept of equation and used two solving methods.. The analysis of
The case n = 3, where we considered Cayley’s hyperdeterminant and the Lagrangian Grass- mannian LG(3, 6), and the case n = 6, where we considered the spinor variety S 6 ⊂ P
This is applied to the obstacle problem, partial balayage, quadrature domains and Hele-Shaw flow moving boundary problems, and we obtain sharp estimates of the curvature of