Barge-in Free Spoken Dialogue Interface Based on Response Sound Cancellation Using Sound field Control and Microphone Array

全文

(1)BARGE-IN FREE SPOKEN DIALOGUE INTERFACE BASED ON RESPONSE SOUND CANCELLATION USING SOUND FIELD CONTROL AND民但CROPHONE ARRAY. Shigeki M砂αbe， Hiroshi Saruwatari. and. Kiyohiro Shikano. Graduate School of Infonn a tion Science， Na r a Institute of Science and Technology 8916・5 Takayama-cho， Ikoma-shi， Nara， 630・0192， JAPAN (Email: [email protected]) l. IrぜTRODUCTION 百is paper describes a new small-scal巴interface for a barge-in free spoken dialogue system combining a multichannel sound field control and a microphone array. To implement a hands-合的spoken dialogue syはem， it is indispensable to prevent the recognition performance企om being degraded by the mixture of田町、speωh阻d the sy蹴m's response sound from the loudspeaker. To eliminate the response sound，飢acoustic echo canceller is commonly used [1]. Many types of echo cancellers have been proposed， e.g.， integrated with a bemぱormer [2]. However a double飽1k detection， which is difficult to actualize precisely， is necess訂Y to implement阻，y acoustic echo canceller. To address the problem of the acoustic echo canceller， one of the authors has proposed Multiple-Output阻d Multiple-No・Input (MOÞ.⑪IT) method [3]， which combines the sound field control and the microphone array. To mak:e the∞ntrolはabl巴 MO.MNI method， however， requires many loudspeakers. To solve the problem of MOMNI method， we introduce a new method ω realize silence on the point of th巴 microphones stably with fewer loudspeakers. T he feasibility of the proposed algori血m can be shown by the experiment.. 2. PROPOSED勘IETHOD MO恥町I me出od prl町ents the response sound企om being inputted inω the microphones via the sound field control. Silent signals are represented on the microphone elements while response sound signa1s are represented on user's ears. To realize such a sound field， MOMNI method designs an inverse創ter of room tr佃sfer functions [4]. However， since MOMNI method should set the control points on the user's回rs as well as the microphone elements， MOMNI method requires many loud speak:ers ω∞n住01 many points stably withぬ.e inverse filter. To decrease出e number of loudspeはers， we propose a new filter design method to actua1ize a sound field where白巴陀やonse sound is canceled out on∞ntrol points. Since no other con位。l point th釦the microphone elements are set，也is me位lod can control the sound field with fewer loudspeakers. Figure 1 shows the configuration of the proposed method. The number of the loudspeakers， M， and the number of microphone elements， K， must sa討sfシ血e specific∞ndition M > K. Room位ansfer白nctions between each loudspeaker組d each microphone are denoted by K x M matrixG(ω). The monaural response sound signal is radiated from the loudspeakers a.fter be均processed by filters， whose coefficients are given by the M -dimensiona1 column vector B(ω).百len泊施。bserved signals corresponding to the response sound X(ω) are represented by the K-dimensional vector Y(ω) as follows， ( 1). Y(ω) = G(ω)B(ω)X(ω).. Therefore the following condition must be satisfied when the response sound is canceled out on the positions of microphones. G(ω)B(ω) = OK. su対ect to. IIB(ω)11 =. 1，. (2). where 0K is a K-dimensional column zero vector， and 出e norm of B(ω) is∞回国泊ed ω avoid凶.vial filter coefficients which outp山no signal. Equation (2) shows伽t B(ω) is orthogonal to all rows of G(ω). A set of such凶d of vectors is. called恥nu/lspace. Singular value decomposition can provide the vectors which sp姐S出e n凶space ofG(ω) in the form. of eigenvectorsωrresponding to zero sin伊lar va1ues. In brief words， Eq. (2) c姐be satisfied with any B(ω) designed by 的i位叩Iy combinating the eigenvectors with an appropriate normalization. The number of eigenvectors is M - R(ω) where R(ω) is a rank ofG(ω). At least one eigenvector exists because of the in何回lity M - R(ω) >M-K>O. Although any combination of the nullspace eigenvectors satisfies the ∞ndition， the filters randomly selected 企om nullspace eigenvectors in the individual frequency bin are likely to distort the output because of the circular convolution effect. To design low distortion filters， we select the vector from the nullspace on the basis of the ne町制neighborhood to an M-dinlensional vecωr. L. =. [1，...， lJT.. At fir幻自e a1gorithm回lves a le回squares problem， and恥n norma1izes the. solution in order to satisfy the condition of the norm.百e resultant solution is given by B(ω) = where. 1匂I is an M. x. v;，ull(ω)叫I(叫ん-R(ω)・(LHv;，ull(ω)凶1(叫ん-R(ω))-0.5，. (3). (M - R(ω)) ma位ix whose columns are eigenvectors of G(ω)∞rrespond泊g to zero s泊肝lar values.. The input signal of microphones are applied to delay-and-sum array sigi1al process泊g to enhance也e user's speech and suppress批response sound.. HSCMA， March 17-18，. 2005， Rutgers University， Piscataway， New JerSey， USA. a-7. 円ペu qd 円L.

(2) Figure 1:∞nfi忽.rration of proposed method.. -. ト_.: At:.∞凶ce蜘国間ner .: MOMNI meth凶. 園内叩曲凶method I. t'ftl.__________________________.1 ζ-一...“..-一一.. ，一 …一一一.. 一一 .引骨 t ω一占ι 一 ...・且.1 回ト "^ 一一 � 2 0咋ド-=橿圃二Lニ宣唱4二 :..�は霊初叫ド橿;橿園 ; 橿 -逼遍二.� 至 40岨十晴.令令圃 -一一一-… .一.. 画十ぽ相至十置園.. .置置-置t 守1圃圃.圃圃. 忌 1.圃..圏. ... .園11 < 1 _ _圃 E圃. 11圏. 1 E匡r 圃. .園.1 匡 l圃圃.圃圃.圃圃圃 .圃圃1 < 2 1 1 置圃 .... . 戸 1 主泊叶ト _.圃 g酌 0叶ト圃置E 圃凶E置 1惜g;; 叶叶0 ト圃置-圃. .置 '1置置 f 雲;: D叶叶1 . ..橿_ 0 1 2 3 4 0 ‘ 2 3 " 口;""2."'3."7' 0 ' 1 2 3 4". �l 一. _ .. _1. Number of mlcr甲hone elem聞ts. � Ir. 圃. l・. Number of m1cr叩hone elements. Number of micr叩h聞e. (的WAwith5刷ds同ak師. (d) WA wilh 81則自陣aka問. Number of micr叩hone elements. (a) BRR with 5刷出向ak師(b) BRR with 810uds同時間. elemer市. Figure 2: Experimental resultsおr di能rent loudspeaker and microphone conditions . 3 . EXPERIMENτ:URESULTS In由is 侃periment， we evaluate the robu成田SS of the proposed method against也e fiuc机lation of智也lSfer function， compared wi也 the acoustic echo canceller and MOMNI method. We premise that fiuctuation of仕組sfer functions is CaJぉed by changes in白e interference， i.e.， a life-size m叩nequin. We measured 13 kinds of imp川lse responses: 12 pattぽns are血e states where the interference is allocated， and the 0也er pattern is 也e states where the interference dóes not exist. First，ωevaluate也.e peばormance of response sound elimination， we calculate白e barge・in reduction r蹴但RR) as B限= 10l0glO. ( L IYe町(ωWI乞|丸山W) 、 w. 弘. where 九ar(ω) is the response sound observed 鉱山e user's left ear， and. [担J.. (4). 九回(ω) is白崎町or of問sponse sound elimination.. MOMNI and the proposed method design their filters using the仕出lSfer function before出e fiuctuations. We assume that出E a∞ustic echo回nceller回n e拍nate the filter coe伍cient precisely under白巴ideal ∞ndition without eπor. We evaluate the performance of each method in位1e fiuctuated en'vironment. Secondly， the effi回cy of the el注unation ofせ1e response sound is evaluated with a加ge vocabulary continuous speech reco伊ition task. To evaluate恥speech recognition perform組問we adοpt the Word Accuracy σWA).Figure 2 illutates也e BRR and WA results with 5 or 810u必pωkers. With 5 loudspeakers， the proposed m巴thod shows higher performances in BRR and WA (s田Fig.2(a)阻d (c)). Wi也 8 loudspeakers， MOMNI method becomes ro加st， and there ar官no obvious improvement in the proposed method (see Fig. 2φ) and (d)).百1US也e propo唱ed method is highly beneficial for the appliωtion to the small number of loudspeakers. 4. CONCLUSIONS We proposed a small.size barge-in free interface using a sound cancellation. As也e results of th虐experiment， the robustness of sound elimination irnpro刊d when血巴number of loudspeakers is relatively small. From these findings， the availability of 血e proposed method is ascertained.. Acknowledgement T his work was partly suppo巾d by CREST progr沼n "Advanced Media Technology for Everyday Living" of JST in Jap田.. References [ 1] B. H. J国時間dF.K.S∞ng，官ands-仕切telecommunications，" Proc. lntemational Workshop on Hands-Fl陀E争eech co明munication 2001， pp.5-lO， 2001 [2] W. Heぬordt， J. Ymg， H. Buchner， and W. Kellerman，“'A real-time acoustic human-machine合ont.end おr multimedia applications integrating robust adaptive beamforming and stereophonic acoustic echo canωllation，" Proc. ICSLP 2002， vo1.2，pp.733-ー776，2002. [3 ] Y. Hin阻10ぬ， K. Mino， H. Saruwatari，叩dK. Shikano， “h胞rf旨ce for barge-in合ee spoken dialo思le system based on ， sound field control and micropno田町置が Proc. ICASSP 2003， voLV pp.50 5- 508， 2003. [4] Y. 11蹴kura， H. Saruwatari， andK. Shikano，“'An iteratÎve inverse filter design m拙od for the m出channel sound field reproduction system，" IEICE Trans. Fundamentals， voLE84・A， no.4， pp.991-998，却01.. HSCMA， March 17.18， 2005， Rutgers Unive尽ity， Piscataway， New Jersey， USA. a.8. AUτ qu 円〆“.

(3)