• 検索結果がありません。

Musical-Noise-Free Blind Speech Extraction Using ICA-Based Noise Estimation and Iterative Spectral Subtraction

N/A
N/A
Protected

Academic year: 2021

シェア "Musical-Noise-Free Blind Speech Extraction Using ICA-Based Noise Estimation and Iterative Spectral Subtraction"

Copied!
6
0
0

読み込み中.... (全文を見る)

全文

(1)The 11th International Conference on Information Sciences, Signal Processing and their Applications目 Main Tracks. MUSICAL-NOISE-FREE BLIND SPEECH EXTRACTION USING ICA・BASEDNOISE ESTIMATION ANDITERATIVE SPECTRAL SUBTRACTION ↑Ryoichi Miyazaki, t Hiroshi Saruwatari,↑Kiyohiro Shikano, 1: Kazunobu Kondo ↑Nara Institute of Science and Technology, Nara, 630-0192 Japan. 1:Yamaha Corporate Research & Development Center, Shizuoka, 438-0192 Japan. ABSTRACT. 1n this paper, we propose a new iterative signal extraction method using a microphon巴 array that can be applied to nonstationary noise. 1n our previous study, it was found that optimized itera­ tive spectral subtraction (SS) results in spe巴ch 巴nhancement with almost no musical noise generation, but this method is valid only for stationary noise. Th巴 proposed method consists of iterative blind dynamic nois巴 estimation by independent component anal­ ysis (1CA) and musical-noise-台ee sp巴巴ch ex汀action by modi­ fied iterative SS, where multiple iterative SS is applied to each channel while maint創ning the multi-channel property reused for 1CA. Also, related to the proposed method, we discuss th巴jUS­ tification of applying ICA to such signals nonlinearly distorted by SS.From obj巴ctive and subj巴ctJv巴 evaluations, we reveaJ that the proposed method outperforms the conventional method lndex Terms- Speech enhancement, higher-order statis­. 2.1.. RELATED WORKS. Conventional non-iterative SS. [1]. We apply short-time Fourier analysis to the obs巴rved signal, which is a mixture of target sp巴巴ch and noise, to obtain the time-frequency signal. We formulate conv巴ntional non-iterative SS [1] in th巴 time-frequ巴ncy domain as follows:. ( <. Jlx(f, T)12ーβE[INI2J exp(jarg(x(f, T))) (if IX(J,TW>βE[INI2]), l ηx(f , T) (otherwise),. (1). rr可TRODUCTION. 1n recent studies, many applications of hands-free speech ∞m­ munication syst巴ms have been investigat巴d, for which noise re­ duction is a problem requiring urgent attention. Spectral subtrac­ tion (SS) is a ∞mrnonly used nois巴 reduction method that has high noise reduction p釘formance [1]. However, in this method, artificiaJ distortion, so-called musical noise, arises owing to non­. linear signal processing, leading to a serious deterioration of sound quality. To achiev巴 high-quality noise reduction with 10明musical noise, an iterative SS method has been proposed [2, 3, 4]. This method is performed through signal processing in which weak SS proc巴sses are iteratively appli巴d to the input signal. Also, some of the authors have report巴d the very interesting phenomenon that this method with appropriate parameters gives 巴quilibrium behavior in the growth of higher-order statistics with increas ing number of iterations [5]. This m巴ans that almost no musi­ cal nois巴 is generated ev巴n with high noise reduction, which is one of the most desirable prop巴rt1巴5 of single-ch釦n巴1 nonlinear noise reduction methods. 1n conventional i飽ratJv巴 SS, howe、巴r,1t IS assum巴d that the input nois巴 signal is stationary, meaning that we can 巴sllmate the exp巴ctation of noise power spec汀al density from a time­ frequency p釘iod of a signal that cont創ns only noise. 1n ∞n­ trast, under real-world acoustical environments, 巴.g., a nonsta­ tionary noise field, aJthough it is necessary to dynarnically es­ tImate nOlse,出s is very difficult. Therefore, in仕山paper, we. wh巴re y(J, T) is the enhanced target speech signal, x(f, T)】5 the observ巴d signal, f denotes th巴 fr巴quency subband,T is the frame index,β is the oversubtraction parameter, and ηis the fiooring parameter. Here, E[INI2J is the expectation of th巴 random vari­ able INI2 corresponding to the noise power sp巴ctra. 1n p即tJce, we can approximate E[INI2J by averaging the ob悶v巴d no凶 pow巴町rs叩p巴氏c 汀回al同η(υfλ,T吋w三 m 血efi凶 K -sample frames, where we assume the absence of speech in this period and noise statÌon­ arity. However, this often requires high-accuracy voice activity detection.. 2.2.. Iterative SS. [2, 3, 4]. In an attempt to achieve high-quality noise reduction with low musical noise, an improv巴d method bas巴d on it巴rative SS was proposed in previous studies [2, 3, 4]. This method is perfo口n巴d through signal processing, in which th巴 following weak SS pro­ cesses町e recursively applied to the noise signal (see Fig.. 1). (I). The average power spectrum of the input noise is estimated. (II) The estimated noise prototype is then subtracted from the input with the p紅白neters specifically set for weak subtraction, e.g., a large fiooring p釘ameter ηand a small subtraction parameterβ. (III) We then return to step (1) and substitute the resultant output (partially noise reduced signal) for the input signal.. 2.3.. Modeling of input signal. In this paper, we assume that the input signal x in the power spectral domain is modeled using the gamma distribution as. propose a n巴w iterative signal extraction method using a micro­ phone array that can be applied to nonstationary noise. Our pro­ posed method consists of iterative blind dynamic noise estima­ tion by independ巴nt compon巴nt analysis (1CA) [6] and musical­ noise-仕ee speech ex位action by modified iterativ官SS,where mul-. P(x) = どこ:z州-xje), r(a) e. ω. where x 主 0, α> 0, and e > O. H巴re,αis the shape param­ eter, e is the s叫e parameter, and r α ( ) is the gamma function,. This work was supported by th唱MIC SCOPE, and JST Core Re. defined as. search of Evolution Science and Technology (CREST), Japan.. 978-1-4673-0382-8/12/$31.00 @2012 IEEE. 2.. y(f,T) =. tics, iterative spectral sub回ction, microphone array. 1.. tipl巴 iterative SS is appli巴d to each channel while maintaining the multi-channel property reused for 1CA. Also, related to the proposed method, we discuss the justification of applying ICA to such signals nonlinearly distorted by SS. From 0句巴ctive and subjective evaluations, we r巴veal that the proposed method out­ performs th巴 conventional method.. 322. r(α) = f DO t,,-l exp( -t)dt. O.

(2) Noise. Harmful. 12. •. 11 ・ non-i悟旧tive iterative (η= 0.9) I | 10ト→←. EH 国』由 一 的OZ2X. 。ucoaaマ. 2 Natura!. 。. 。. ・ _ . ・ ・・,. , ・ 、/ ,、 ,、. 2. 4. . _ -. -. •. ,/ ,、. 6. •. •. •. -. •. •. •. •. 、 ,. 8. 10. 12. Noise reduction rate [dB] Fig. 2. Relation between NRR and kurtosis ratio obtained from theoretical analysis for Gaussian noise case. Also, as a measure of noise r巴duction p巴rformance,the noise r巴duction rate (NRR), the output signaトto-noise ratio (SNR) mi­ nus the input SNR in dB, can be given in terms of a 1st-order moment as [5]. Fig. 1. Block diagram of iterative SS目. NRR = 0 1 loglO{αn/M(αn,β,η1 , )}.. 2.4. Mathematical metric of musical noise generation via higher­ order statistics for non-iterative SS. [7]. 2.5. Musical-noise-free speech enhancement. In this study, we apply the kurtosis rafio to a noise-only fime­ frequency period of the subject signal for the assessment of mu­ sical noise [7]. This measure is defined as kurtosis ratio = kurtproc/kurto酌. (3). (4). HF. z d z p z nu ρ' ' '' '' 一 一. whereμm is the mth-order moment, given by. (5). and P(x) is the probabil町dens町function (p.d.f.) of a poweト spectraトdomain signal x. A kurtosis ratio of unity co汀esponds to no musical noise. This measure increases as the amount of generat巴d musical noise increases. The mth-order moment after SS,μm, is given by [5] μm =0:ん1(αnβ , η , ,m),. ポ={F(αn,ß,4)(αn+l)αn-F2(αn,ß,2)(αn+3)(αn+2)}-1. [. (6). 一{F(αnβ , ,4)(αn+l)αn-F2(αn,ß2 , )(αn+3)(αn+2)} 『よl. σ). {S(αn,β,4)(αn+l)α n -S2(αn,β,2)(αn+3)(αn+2)} 1 2. l r(m十 1 )r(αn+m-l,ßαn) ι , S(αnβ,m) = ) � ( βαn) r(αn)r(l+l)rm ( -l+l) tt (αn+mβ , αn) F(αn,ßm , )= r(αn). (13). (8). Figure 2 shows an example of th巴kurtosis ratio in optimized iter­ ative SS, where Gaussian noise is assurned. We can confirm the flat仕ac泡of the kurtosis, indicating no musical noise generation.. (9). , ) are the upper and lower incompl巴te gamma r(bα , ) and γ(bα functions defined as r(b,α)=五団ta-1 exp( -t)dt and ,(bα , )=. J:. ta-1叫(-t)dt, resp巴仰向From (4), (6), a州7),恥kurtosis after SS can be 巴xpress巴d as kurt = M(αn,ß,η,4)/M2(αn,ß,η,2).. , ,η,2) ,ß,η,4)/ "1:0�-'-' M(αn λイ2(αnβ "1 ..�-u':'�'��" 7:7;7::. M(αn0 , 0 , ,4)/M2(αn,0,0,2). 3. PROPOSED METHOD: EXTENSION TO MICROPHONE ARRAY SIGNAL PROCESSING 3.1. Conventional blind spatial subtraction array. (10). In the previous section, we assumed that th巴 input noise signal is stationary, meaning that we can estimate the exp巴ctation of a nOls巴 signal from a time-frequency period of a signal that con­ tains only nois巴, l.e., sp巴巴ch absence. However, in actual en­ vironments, e.g., a nonstationary noise field, it is necess紅Y to dynamically 巴stimate the noise power spectral density.. Using (3) and (10), we also express the kurtosis ratio as. kurtosisratio. S(αnβ2)F(αnβ2)川山). 土[ {S(αn,ß,2)F(αn,ß,柿n+州n+2)}2. where On is th巴 noise scale parameter,αn is the noise shap巴 pa­ rameter, and. M(αn,ß,ηm , )=S(αn,ß,.,.,) + η2mF(αn,ß, .,.,),. [8]. In [8], we have proposed musical-noise-fr巴巴 no】se reduction, wh巴r巴 no musical noise is generat巴d ev巴n for a high SNR in it­ erative SS. In this study, first, some of the authors discovered an interesting phenomenon that th巴 kurtosis ratio sometimes does not change even after SS via mathematical analysis bas巴d on (11) [5]. This indicat巴s that the kurtosis ratio can be maintain巴d at unity even after iteratively applying SS to improve the NRR, 佃d出us no musical noise is generated. Following白is finding, the authors have d巴rived the optimal parameters satisfying the musical-noise-free condition by finding a fixed-point status in theku巾山剛0, i.e., by solving M(αn,00 , ,4)/M2(αn0 , ,02 , )= , ) [8]. Given th巴 noise shape paM(αn,β,η,4)/M2(αn,β,η2 rameterαn, we can choose combinations of the oversubtraction parameterß and白日ooring parameter ηthat simultaneously satisfy th巴 musical-noise-free condition using the followmg equatton;. where kurtproc is th巴kurtosis of the processed signal and kurtorg is th巴kurtosis of出e observ巴d signal. Kurtosis is defined as. kurt =μ4/μ2,. (12). (11). 323. 1.

(3) To solve this probl巴m, we previously proposed blind spa­ tial subtraction a汀ay (BSSA) [9], which involves accurate noise estimation by ICA followed by a speech extraction procedure based on SS (se巴 Fig. 3). BSSA improves the noise reduction performance, particularly in the presenc巴 of both of diffus巴 and nonstationary noises; thus, almost all the environmental noise can b巴 dealt with. However, BSSA always suffers from musical noise owing to SS. In addition, the output signal of BSSA de­ generates to a monaural (not multi-chann巴1) signal, meaning that ICA cannot b巴 reappli巴d; thus, w巴 cannot It巴ratJv巴Iy estimate the noise power spectra. Th巴refore,it is impossible to directly apply iterative SS to the conventional BSSA.. User's speech. Fig. 3. Block diagram of conventional BSSA [9]. (ill) Next, we p巴rform SS independently in each input channel and d巴nv巴 the multiple target-speech-enhanced signals. This proc巴dure can be given by. 3.2. Iterative blind spatial subtraction array. zt+勺,T)=. In this section, we propose a new multi-iterative blind signal ex­ tractlOn me出od int巴grating iterative blind noise estimation by ICA and iterative nois巴 reduction by SS. As mentioned prevト 。usly, the conventional BSSA cannot iteratively and accurately estimate noise by ICA b巴cause the conv巴ntional BSSA performs a delay and sum (DS) operation before SS. To solve this prob­ lem, we propose a new BSSA structure that performs multiple ind巴p巴ndent SS in each channel b巴fore DS; we call this struc­ ture channel-wise SS. Using this structure, we can equalize the number of channels of the obs巴rved signal to that of the signals after chann巴I-wise SS. Therefore, we can iteratively apply nois巴 estimation by ICA and speech extraction by SS (see Fig. 4). Also, th巴 advantage of the proposed structure is that ICA has th巴 possibility of adaptively estimating th巴distorted wavejト'ont of a speech signal to some extent even after SS, because ICA is a blind signal identifìcation method that does not require lrnowl­ edge of the target signal direction. Details of this issue will be discussed in Sect. 3.3. H巴reafter,we refer to this proposed BSSA as iterative BSSA目 We conduct iterative BSSA in批following manner, where the superscript [i]時間ents the value in the ith ite凶ion of SS (initially i= 0). 2位 r略gg a叫紅 )12 - ß lztJ(I,T)門 |戸 内 匂切切判刈刈 e 却 X j川 凶 戸叫州叫 p 叫叫… 判刷 山(山 ω凶山j山 x 3 TW), �J (1, T吋) 1戸2>βIz1' J (1, I ηXtJ(I,T) (0山川吋,. 町 (if. where xr+1J(I,T) is t恥h巴 e t訂時g巴et-sp蹴 obtained b句yS岱S a瓜t a s叩P巴悶cαifìc channel k. T叩he叩nw附e r陀巴tωurnn tωo s叫t飽町E叩p(伺町 W油 x(ド叫+斗 (げfι,T寸). When we obtain su幽 曲 1 飴飽 刷cii悦民E叩nIt瓜t noise reduction perf,おormanc巴,go to st巴p (IV) (IV) Finally, we obtain the resultant target-sp巴巴ch-enhanced sig­. nal by applying DS to x(o] (1, T5, wher巴 * is the number of iterations after which suffìcient noise reduction perfor­ mance is obtained. This procedure can be expressed by. y(!, T) =ω6s(l)X('J(I,T), ωos(l)= [ωjDめ(1),. . . , ω7 町 (1)] ,. J S) 作 去. K-chann巴1 array in the time-frequ巴ncy domain, x(Oj (1, T), is given by. (14). (ll) Next, we perform 5】gnal separation using ICA as [6]. (15) 。(iJ(I,T) =WI�A(I)X(iJ(!,T), wiZT円1)=μ[1 - (cp(o(iJ(I,T))(o(iJ(I,T))H)r] (υ16め) . wl��J(f) + wl��J( I) ,. (23). 3.3. Accuracy of wavefront estimated by ICA after 88. In this subs巴ction,we discuss the accuracy of the estimated noise signal in 巴ach iteration of iterative BSSA. In actual environ­ ments, not only point-source noise but also non-point-source (e.g., diffuse) noise often exists. It is known that ICA is pro­ fìcient in noise estimation rather than spe氾ch estirnation under such a noise condition [9]. Th】s is because th巴 target speech can be regarded as a point-source signal (thus, the wavefront is static in each subband) and ICA acts as an e仔巴ctive block­ ing自Iter of th巴 speech wavefront even in a time-invariant man­ ner, resulting in good nois巴 巴stimation. However, in it巴ratlVe BSSA, we should address the inherent question of whether th巴 distort巴d sp巴ech wavefront after nonlinear noise reduction such as SS can be blocked by ICA or not; thus, the speech component after channel-wise SS can become a point source again or not. Hereafter, we quantify the degree of point-source-likeness for SS-applied sp巴巴ch signals. For convenience of discussion, a sirnple two-channel array model is assumed. First, we defìne th巴. W叫he悶r印巴 Wi見2法(�J (げ1)凶i芯sad必巴mlx別叩xm汀m昭】g matn爪1. p paramet巴民r, 刷 iおsu凶s吋 tωoe臥xp陀res路s t出h巴 V刊叫a討lu巴 of t出h】巴 pμ凶th 幻悦巴叩 in the ICA印刷ions, 1 is the id巴ntity matrix, (・)r de­ notes a tJm巴引eraging operator, and cp(・) is an appropri­ ate nonlinear vector function. Then, we construct a noise­. only vector,. (17). where U is the signal number for sp巴巴ch, and we apply the projection back operation to remove the ambiguity of the amplitud巴 and construct the estimated noise signal, z(内f,T), as. Z(iJ(I,T) = wl�A(I)-lo�lise(l,T).. (21). wher巴y(l,T) is the fìnal output signal of iterative BSSA, ωos IS白e fìlter coe節cient vector of DS, N is the DFT size, fs is the sampling frequency, dk is the microphon巴 position, c is th巴 sound velocity, and (Ju is the estimated direction of arrival of the target sp即h. Moreover, [A]lj repr巴sents th巴 entry of A in the lth row and jth column.. where h(!) = [h1(1), h (1) • • • , hK (I)]T is a column 2 V巴ctor of the transfer functions from the target signal po­ sition to each microphone, s(l,T) is the target speech sig­ nal, and n(l,T) is a column vector of the additive noise. 021」fr)=lofl(fr),...,oV13o, OL(f,7), ,oE(fr)lT,. (20). 州 - 2 j(ljN)fsdk sin (JU jc),。. ( [wlL(f)-1IMI \ ー1山o \. [WI�A (J)-l]kIU ) (JU =sm 2πfsc-1 (dk - dkl). (1) The observed signal vector of the. x(Oj (1,T) = h(!)s(l,T) + n(l,T),. (19). (18). 324.

(4) 15t iteration. 2nd Iteration. Final iteration. Useピs speech. Fig. 4. Block diagram of proposed iterative BSSA.. speech component in each chann巴1 after channel-wise SS as. , ) =ん(I)s(lT , ) +ムSI(1,T), れ(1T , )=h2(1)s(l,T) +ムS2(1T ゐ(IT , ),. and thus. the amount of r巴sidual noise after the ith iteration is giv巴n by. (24). dl=Onh{M(αn,ß,η,1)/αn}' = ()nM'(αnβ , η , , 1)α:ー. (25). where s(lT) , is the original point-source speech signal, 九(I, T) is the speech component after channel-wise SS at th巴kth chan­ , ) is the speech component distorted by channel­ nel,andムs k (lT wise SS. Also, w巴 aおss刊u叩lf町問I ar巴 u叩m】に悶c∞or汀r印巴lat飽巴d with 巴acho瓜th巴釘r. Obviousl片y,5む1 (1,T) and 52(げ/,T) can b巴 regard巴d as b巴i昭g巴n巴rated by a po削source ifムSI (1,T) andムS2(1,T) 釘e zero, i.e., a valid static block.ing自Iter can be obtain巴d by ICA as. Next, we assume that speech and nois巴 ar巴 disjoint,i.e., there are no overlaps in the time-fr巴quency domain, and that sp巴ech dis­ tortion is caused by subtracting the average nois巴 from the pure. 1. sp蹴h component. Thus, the sp悶h component 15�+ 1 (1,T)12 at thekth channel after the ith iteration is r巴presented by sub­ tracting th巴 amount of residual noise (30) as. 15�+勺,T)12= ldl(fT)|2 MM (山1)αi(if 15�J(I,T)12 > β()n M'(αn,ß,η,1)α�-i), η215�I(I,TW (otherwise).. [WICA(I)]1l51(1,T) + [WICA(I)]!252(1,T) =([WICA(I)]llhl(1) + [WICA(I)]!2h2(1))s(l,T) =0, (26). (. where we assume U =1 and, e.g., [WICA(I)]ll =h2(1) and [WICA(1)]12=-hl(1). However, iL::�.S. I (1,T) and ßS2(1,T) becom巴 nonzero as a result of SS, ICA does not have a valid. speech blocking filter with a static (tirne-invariant) form. S巴cond, the cosine distance betw巴en speech power spectra 151 (1,TW and 152(1,TW is introduced in each frequ巴ncy sub­ band to indicate the d巴gr巴巴 of point-source-liken巴ss, as. COS(I) =. 2. ZT 151(1,TW152(1,T)1. 一. 'ー. '一. '一. 、�.. Here, we define the input SNR as the average of both channel SNRs. 2 () \ .... . --- I . :" , 0.llh .. --- 1ん( I "' � \ fW()s 一 (0....1 11 " 2\(1)1 11 ,, � + 0s ) � � ( \ ) (ìn()n αn()n 2 2 間 = (lhl(1)1 + Ih2(1)1 ). 器. (27). If we normalize the sp巴巴ch scale parameter ()s to unity, from (32), the noise scale parameter ()n is given by. 。. n -. o1(lhl(1W + Ih2(1W) 2αnISNR(I). (33). and using (33), we can reformulate (31) as. 15�+lJ(I,T)12 = 2_ |(引州 怖刷九 山1 巾 山 口 川 -刊 ß訓 イ 川川fJ2.川 T f 例 か 川 )吋 1一叩 川 い υ (げ附. {:同tF立sd仰恥恥町山加川)1附机d:T山sd). 巾L;i叫UJ訪;万γY;f戸ρ(げωfυ川). (げ/,TW>ß Q η ポ約 2 1凶ds記�I (げJ,T引)川12 (otherwise) (但ifげ. (34). Furthermore, we define the transfer function ratio (TFR) as. 2 TFR(f) =Ihl(1)/h2(1)1 ,. (28). (35). and if we normalize Ih1(lW to unity in each frequency sub­ band, Ihl(1)12 + Ih2(1)12 becomes 1 + I/TFR(I). Finally, we express (34) in terms of the input SNR I SNR(f) and the transfer function ratio TFR(I) as. where ()s is the sp巴ech scale parameter. Regarding th巴 amount of noise to b巴 subtract巴d,the 1 st-order moment of the noise power sp巴ctra is equal to ()nαn when the number of iterations, i,巴quals zero. Also, th巴 valu巴 ofαn does not change in each it巴ratJon when we 脱出e specific 阿arnetersβand 'f/ that sati均the musical-noise-free condition because the kurtosis ratio does not change in each iteration. If we perform SS only once, the rate of noise d巴crease is given by. n, , M(αn,ß, 'f/1)/α. (31). NR(I) =. From (27), the cosin巴 distance r巴aches its maximum value of unity if and only if ムSI (1,T) =ムS2(1T , ) - 0, regardless of the values of hl(1) and h2(1), meaning that th巴 SS-applied sp巴巴ch signals 51 (1,T) and 52(1,T) can be assumed to be pro­ duc巴d by the point source. The value of COS(I) decreases with increasing magnitudes of ムSI (1,T) and ムS 2(1T , ) as well as the di仔巴renc巴 between hl(1) and h 2(1); this indicat巴s the non­ pOInt-source state. Third, we evaluate the degree of point-source-likeness in each iteration of iterative BSSA by using COS(I). We statis­ tically estimate the distort巴d speech component of the enhanc巴d signal in each iteration. Here, we assume that the original speech power spectrum IS(l,T)12 obeys a gamma d凶ribution with a shape parameter of 0.1 (this is a typical value for speech) as. 一-0.9 2 1 exp( -x/ ()s), IS(l,T)1 � ーと一一 r(O.1 ) ()�. (30). 15�+lJ(IT , )1 2 = ldl(fT)|2川むWlM (α州1)αJ 1 (if 15�J(IT , )12>β0 (;242忠川M包(αnß , η , 1 , )α�t), , W (otherwise). η215�J(IT. (. (29). (36). 325.

(5) Value 01 TFR(I) (lh,(I)/h,(I)I') (a) Input SNR is set to 10 dB. 1. ー0-1.0 一王ト0.9 --fr-0.8 (b) Input SNR is set to 5 dB. 0. 7. γ. T一一ì. 0.9. 08 干44Ji。付 人主下半ギ工l i ? 寸包0.7 ↓. E 壱. 、、..... '0 I II2! -、....____ :... I ,;; 0 9 ( 、ト~ム { δ . f. E. 5. 80.98 ;=1 0.97. 0.7 一・←-0.6 ___0.5 一昔「ー0.4 (c) Input SNR is set to dB 1�ト---0---0-O---C'. 冒r--.=--、....._ーム. 上. l! 0.99. ーベ:>-. 3. 4. 5. ;=2 1 6. ;=3. ;",4. 7 8 9. Î:a:5. 0. ;=6. 1 11 12 Noise reduction rate [dB]. 'ÎI. ;=7. ...J. 13 14 15. I. I. I. ....斗 ... ;=1. 苧2. 0.85 ;j! 4. -� -fj. 苧3. 辛...... : 、� T�. -.I 12! � j 8 0.6f. ;=4. I. ;=5. ;=6. � 苧 ;=7. . - .. ...J10_. - 1;3 14 1U 11 12 f H 9 Noise reduction rate [dB]. 一 一 一. 一. 0.5 f. l. 壮一. �1 ;=1 i. 0.4 3I 4. 5. 、"ム .. l ...... .... , : -司 、『 �2 ;=2 i. 6. ぷf. ;=3 、、也 Î=. '!"、、、. 7. 9. 0. T5. T6. 6 1 11 12 Noise reduction rate [dB]. T7. 13 14. 15. Fig. 5. Relation between number of iterations of iterative BSSA and cosine distance. Input SNR is (a) 10 dB. (b) 5 dB, and (c) 0 dB.. As can be s巴en, the sp巴ech component is subj巴cted to greater subtraction and distortion as ISNR(f) and/or TFR(f) d巴cr巴ase. Figure 5 shows the relation between the TFR and th巴 cor­ responding value of COS(f) calculated by (27) and (36). In Fig. 5, we plot the averag巴 ofCOS(f) over the whole frequ巴ncy subbands. Th巴 nOls巴 shap巴 parameterαn is set to 0.2 with the assumption of super-Gaussian noise (this corresponds to the real nOlses us巴d in Sect. 4), the input SNR is set to 10 dB, 5 dB, or o dB, and the noise scale parameter en is uniqu巴Iy determined by (33) and the previous parameter s巴ttings. The TFR is set from 0.4 to 1.0 (lh1 (f) I is fixed to 1.0). Note that血e TFR is highly correlat巴d to the room reverberation and the interelement spacing of the microphone array; we determined the rang巴 of the TFR by simulating a typical moderately reverb巴rant room and the array with 2.15 cm int巴r巴lem巴nt spacing used in Sωt. 4 (see the example of出巴 TFR in Fig. 6). For the int巴mal param巴t巴rs used in it巴rative BSSA in this simulation,βand ηare 8.5 and 0.9, respectively, which satisfy the musical-noise-free condition. In addition, th巴 small巴st value on the horizontal axis is 3 dB in Fig. 5 because DS is still perforrned even when i = O. From Figs. 5(a) and (b), which correspond to relatively high input SNRs, we c組confirrn that the degree of point-source­ likeness, iムCOS(f), is almost maintained wh巴n the TFR 】S close to 1 even if the speech components are distorted by itera­ tive BSSA. AIso, it is worth m巴ntioning that th巴 degree of point­ source-likeness is still above 0.9 even when th巴 TFR is decreased to 0.4 and i is increased to 6. This means that almost 90% of the speech components can be regarded as a point source and thus can be blocked by ICA. In contrast, from Fig. 5(c), which shows the cas巴 of a low input SNR, wh巴n the TFR is dropp巴d to 0.4 and i is more than 3, the d巴gree of point-sourc巴ーlikeness is lower than 0.6. Thus, less than 60% of th巴 sp巴ech components can be re­ gard巴d as a point source. However, this is a worst-case scenario; actually TFR(f) in almost all frequency subbands is more than. 0.6. except for in the specific subband with the lowest. TFR. 0.8. f. 0.6. Zニ. ε0.4 z iム ト. 0.2 。. 。. 1000 2000 3000 4000 5000 6000 7000 8000 Frequency [Hz]. Fig. 6. Typical example of. frequ巴ncy subband.. TFR(f) (lh1 (f) /h2 (f) 12 ) in each. sampl巴d at 16 kHz with 16-bit accuracy. The observ巴d signal consisted of the target speech signal of six spe法ers (three males and three femal巴s) and four types of real diffuse noise (museum noise, railway station noise, 住affic noise, and en回nce hall noise [10]) emitted from eight suπounding loudspeak巴rs. The input SNR was 0 dB. The FFf size was 1024. and the fram巴 shift length was 256. 4.2. 0同ective evaluation. F】rst, Fig. 7 shows a typical exampl巴 of behaviors for the pro­ posed it巴rative BSSA under the museum noise condition. In this figure, we depict th巴 values of the kurtosis ratio defined by (3) and ceps甘al distortion [11] at each iteration; thes巴 scor巴s indi­ cate the amount of musical nois巴 generat巴d and speech distor­ tion, respectively. The kurtosis ratio trace for the proposed it巴r­ ative BSSA in Fig. 7(a) is almost flat and close to 1.0, meaning that musical-nois巴ーfree noise reduction is achieved. However, as shown in Fig. 7(b), the speech distortion increases as the num­ ber of iterations and NRR increase. Thus, we should carefully set the maximum number of iterations. Next, we compare iterative BSSA with the conventional BSSA under the same NRR condition, where the NRR is set to ap­ proximately 8 dB. Figure 8 shows the kurtosis ratio and cepstral distortion obtain巴d from the experiment with real noisy speech data for each noise. From Fig. 8, we c釦confìrm血at iterative BSSA outperforms the conventional BSSA in terms of the kurtoSlS rallo目In particular, th巴 kurtosis ratio for the propos巴d method is close to 1.0. This m岡山 that the proposed iterative method did not generate any musical noise. How巴ver, iterative BSSA leads to greater sp巴氾ch distortion. Therefore, a trade-off exists betwe巴n the amount of musical noise generation and speech distortion in the conventional BSSA and it巴rative BSSA. (see. Fig. 6), meaning that almost 70% of th巴 sp巴ech components can be regarded as a point source. From the above-mentioned discus­ sion, we have justified the use of ICA for dynarnic noise estima­ tion in iterative BSSA, particularly when we limit the maximum number of iterations to, e.g., 3-5. 4. EVALUATION EXPERIMENTS AND RESULTS 4.1. Experimental conditions. We conducted objectiv巴 and subjective evaluation experiments to confìrm血e validity of the proposed method. We used an 巴ight-element microphone array with an interelement spacing of 2.15 cm, and the direction of the target speech was set to be normal to the array. The size of the experimental room was 4.2 x 3.5 x 3. 0 m3 and th巴 reverberation time was approxi­ mately 200 ms. AlI the signals used in this experiment were. 326.

(6) 0 国. { 国司 }E g 亡。窃 一 司有 』 戸田且。。. Convenliönal BSSA. 1.5. 』. '" 帥. a‘. a・. a色. a・. 7 !???. 。 亡. コ X. Iteralive BSSA. 0.5. ;= 1 ;=2 1:3. o I o. 2. 4. 6. ;",4. 8. 10. Noise reduction rate [d町. aurDaa守内dnζ41. /9. (a );. A QU P3 4 Ru e UM噌! i、hu ll ) t a b ( r e t. 7. 2. ferr巴d. The result of the 巴xperiment is shown in Fig. 9. It is found that iterative BSSA gains a higher preferenc巴 scor巴 than the conv巴ntional BSSA, indicating the higher sound quality of th巴 propos巴d method in terms of human p巴rception. This result is plausibl巴 because humans are often more sensitiv巴 to musical nois巴 than to speech distortion as indicat巴d in past studies, e.g., [12].. 日. I. Conventional BSS. o. o. 5. CONCLUSION. In this paper, we proposed iterative BSSA using a new BSSA structure, which generates almost no musical noise even with in­ cr巴asmg nOIse r巴duction. From the evaluation 巴xperiments, it was shown that there is a trade-off between the amount of mu­ sical noise generation and sp巴巴ch distortion in both the conven­ tional BSSA and iterative BSSA. However, in a subj巴ctive preι erence test, iterative BSSA obtained a higher preference score than the conv巴ntional BSSA. Thus, iterative BSSA is advanta­ E巴ous to the conventional BSSA in terms of sound quality.. I 2. 4. 6. 8. 10. Noise reduction rate [dB]. Fig. 7. (a) Relation between noise reduction rate and kurtosis. ratio, and (b) relation between noise reduction rat巴 and ceps汀al distortion for museum noise case.. 口. �. C白onve時en叫巾削11. Iterative BSSA. 6. REFERENCES. D. [1] s. F. Boll, “Suppression of acoustic noise in speech using sp巴C回1 subtraction, " IEEE Trans. ACOUSlics, Speech, and Signal Processing, vol.27, no.2, pp.113-120, 1979.. 1.5. T苦 』. [2] K. Yamashita, S. Ogata,T. Shimamura,“Spectral subtrac­ tion iterated with weighting factors, " Proc. IEEE Speech Coding Workshop, pp.138-140, 2002.. 器. 。 亡. 3. X. [3] M. R. Khan, T. Hansen,“Iterative noise power subtraction technique for improv巴d speech qua1ity, " Proc. ICECE2008, pp.39 1-394, 2008. [4] S. Li, J.-Q. Wang, M. Niu, X.-J. Jing, T. Liu, “Iterative sp巴ctral subtraction method for rnillimeter-wave conducted speech enhancement, " Joumal of 8iomedical Science and Engineering, vol.2010, no.3, pp.187-192, 2010. [5] T. Inoue, H. Saruwatari, Y. Takahashi, K. Shikano, K. Kondo, ‘'Theoretical analysis of iterative we必( sp巴ctral subtraction via high巴r-order statistics, " Proc. 凡1LSP20IO, pp.220-225, 2010. [6] P. Comon, “Indep巴ndent component analysis, a new con­ cept?, " Signal Processing, vo1.36, pp.287-314, 1994. [7] Y. Uemura, Y. Takahashi, H. Saruwatari, K. Shikano, K. Kondo, “Automatic optimization schem巴 of spectral subtraction bas巴d on musical noise assessment via higher­ order statistics, " Proc. 11杭4.ENC2008, 2008. 0.5. Museum. Railway stallOn. Traffic. Entrance hall. 4 3 2 1. 目的且@O 司一咽』 国一 亡白恒 国司}c o[. Entrance hall. 5. Traffic. Museum. Railway station. [8] R. Miyazaki, H. Saruwatari, K. Shikano, K. Kondo, “Musical-noise-free speech enhancement: th巴ory and eval­ uation, " Proc. ICASSP20J2 (in printing). [9] Y. Takahashi, T. Takatani, K. Osako, H. Saruwatari, K. Shikano, “Blind spat】al subtraction array for speech enhancement in noisy environment, " IEEE Trans. Audio, Speech, and Lang. Process., vol.l 7, no.4, pp.650-664, 2009. [10] H. Saruwatari, Y. Ishikawa, Y. Takahashi, Y. Inou巴, K. Shikano, K. Kondo, “Musical noise controllable al­ gorithm of chann巴Iwise spec汀al subtraction and adap­ tive bearnforming based on higher order statistics, " IEEE Trans. Audio, Speech, and Lang. Process., vol.l9, no.6, pp.1457-1466,2011. Fig. 8. Results of kurtosis ratio and cepstral distortion for four. types of noise. 口Conventional BSSA. 図l胞rat附BSSA. H95% confidential in恰rval. 際ミミぶミミぶミミミミミ隊総社→ Fig.9. Subj巴ctive evaluation result.. 4.3.. [11] L. Rabiner and B. Juang, Fundamentals of Speech Recog­ nitiol1. Upper Saddle River, NJ: P rentice-Hall, 1993.. Su同ective evaluation. [12] Y. U巴mura, Y. Takahashi, H. Saruwatari, K. Shikano, K. Kondo,“Musical noise gen巴ration analysis for noise re­ duction methods based on sp巴ctral subtraction and MMSE STSA estimation, " Proc. ICASSP, pp.4433-4436, 2009. Since we found the above-mentioned trade-off, we next con­ duct巴d a subjective evaluation for sett1ing the performance com­ petition. In th巴 巴valuation, we pr巴S巴nt吋 a pair of signals pro­ cessed by the conventional BSSA and iterativ巴 BSSA in random order to nine exarninees, who sel巴cted which signal they pre-. 327.

(7)

Fig.  2.  Relation between NRR and kurtosis ratio obtained from  theoretical analysis for Gaussian noise case
Fig. 3.  Block diagram of conventional BSSA [9]
Fig. 4.  Block diagram of proposed iterative BSSA.
Fig. 5.  Relation between number of iterations of iterative BSSA and cosine distance. Input SNR is (a) 10 dB
+2

参照

関連したドキュメント

By us- ing a merit function, a sequential quadratic programming method associated with global trust regions bypasses the non-convex problem.. This method is established by following

The paper is devoted to proving the existence of a compact random attractor for the random dynamical system generated by stochastic three-component reversible Gray-Scott system

By an inverse problem we mean the problem of parameter identification, that means we try to determine some of the unknown values of the model parameters according to measurements in

In this paper, we extend the results of [14, 20] to general minimization-based noise level- free parameter choice rules and general spectral filter-based regularization operators..

At the same time, a new multiplicative noise removal algorithm based on fourth-order PDE model is proposed for the restoration of noisy image.. To apply the proposed model for

In this paper the classes of groups we will be interested in are the following three: groups of the form F k o α Z for F k a free group of finite rank k and α an automorphism of F k

Abstract: By using subtraction-free expressions, we are able to provide a new proof of the Turán inequalities for the Taylor coefficients of a real entire function when the zeros

In this paper, we establish some iterative methods for solving real and complex zeroes of nonlinear equations by using the modified homotopy perturbation method which is mainly due