Speech Kurtosis Estimation From Observed Noisy Signal Based on Generalized gaussian Distribution Prior and Additivity of Cumulants

全文

(1)SP E ECH KURTO SIS E STIお1.ATION FROお1. OBSERVED NOISY SIGNAL BASED ON GENERALIZED GAUSSIAN DISTRffiUTION PRIOR AND ADDITIVITY OF CUお1.ULANTS ↑Ryo Wakisαka，↑Hiroshi Saruwatari，↑Kiyohiro Shikano， and tTomoya Takαtani ↑Nara Institute of Science and Technology， 8916-5 Takayama-cho， Ikoma， Nara， 630-0192 Japan tToyota Motor Corporation， 543 Kirigabora Nishihirose-cho， Toyota， Aichi， 470-0309 Japan. ABSTRACT. 1n this paper， we propose a new method for stable estimation of the ku口osis of a speech power spectrum. Spe巴ch kurtosis can be used for the prediction of speech recognition accuracy as reported in recent studies. However， the conventional estimation method is very unsta ble owing to the high s巴nsitivity of higher-order statistics. To oveト come this problem， we introduce the generalized Gaussian distribu tion prior in order to avoid the calculation of higher-order statistics， and construct a kurtosis table that directly represents the relationship among the kurtosis of speech， noise， and their mixture in the power spectrum domain. Speech kurtosis can be estimated stably from ob servable data by looking up values in the table. An experimental evaluation confirms the ef自cacy of the proposed method.. in this paper we propose a new method using a statistical prior of waveforrn signals， where the waveform signals of sp巴巴ch and noise obey the generaJized Gaussian distribution (7]. We can construct a kurtosis table， which represents the direct relationship among the kurtosis of speech， noise， and their mixture in the power spectrum domain， using the prior， the additivity of cumulants， and a moment cumulant transformation. Then， the kurtosis of the speech power spectrum is estimated from observable signaJs without any refer ences by looking up values in the table. We conduct an evaluation experiment and confirm that the accuracy of speech kurtosis estト mation is markedly improved by the proposed method even if the snapshot data length is only 1 s.. 2.. lndex Terms- Speech kurtosis estimation， Generalized Gaus sian distribution， Kurtosis table， Moment-cumulant 廿ansformation.. 1.. 2.1.. INTRODUCTION. ). x(f， T) = s(f， T) +η(f，T)，. where 1 is the frequency bin number， T is the time-frame index num ber， s(f， T) is the target speech signal component， and n(f， T) is the additive noise signal目Since the speech component is always contam inated with noise at every time-frequency grid， it is difficu1t to esti mate the speech kurtosis via theoretical analysis. Therefore， we in versely calculate the kurtosis of the speech power spec廿um in a data driven manner， u凶山tJ出li山zing two obs犯er刊vable s幻ta剖叩tis鈎tJcωs of the noi路sy s叩pe閃，ech signal x(f， T) and no帥signal n(f， T) (we assume t山ha瓜t the no凶 s坑ta仰tJs山tJ】cωs can be巴 m巴伺as叩u汀r陀ed in a spe民巴ch-aぬbs民E叩n川ttim町me pe叩riod by voice activity detection or BSS-based noise estimation (5]). Note that the proposed speech kurtosis estimation is an unsupervised method be cause it requires no reference (clean) speech signals， unlike cepstral distortion. To∞pe with the mathematical problem that the mixing of speech and noise is additive but generally their higher-order mo・ ments are not additive， we introduce the cwnuLant， which retains the additivity for additive variables. Meanwhile， in the transfor mation from a waveform to its power spectrum， the exponentiation operation is conducted. However， the cumulant does not have a straightforward relationship. In this case， we use the moment in stead of the cumulant. Thus， we previously proposed the use of a. momel1t-cw1tlllant transformation (5].. 2.2.. This work was partly supported by 1ST Core Research of Evolutional Science and Technology (CREST)， 1apan. 978-1-4673-0046-9/12/$26.00 @2012 IEEE. Problem and strategy. In this section， we describe the conventionaJ method of speech kur tosis estimation for evaJuating pure distortion that arises only in the speech component. We consider an acoustic mixing model， where the observed signal consists of a target speech signal and an additive noise signal. Hereafter， the observed signal in the time-仕equency domain， x(f， T) ， is given by l 't、. 1n recent years， many applications of speech communication systems hav巴 been developed， resulting in real-world human interfaces. In such applications， the essential requirement is robustness against en vironm巴ntal noise. Therefore， many nonlinear noise reduction meth ods， such as spectral subtraction and MMSE short-time spectraJ am plitude estimation， have been actively studied (1]. Several metrics have been proposed as measures of evaluation for these methods， namely， the noise reduction rate (or SNR) (2]， ceps廿al distortion (3]， and kurtosis ratio (4]， which co汀espond to the amounts of noise reduction， speech distortion， and musical noise generation， respectively. Since speech distortion affects speech recognition accuracy， a measur巴 of speech distortion is particularly indispensable as an indicator for optimizing speech recognition sys tems. The calculation of cepstral distortion， which is a commonly used measure of speech distortion， requires a reference (clean) speech signal. However， in actual situations， the speech compo nent is always overlapped with noise， and we cannot obtain a clean speech signal. Consequently， the optimal parameters in the noise reduction method cannot be estimated. To overcome this problem， as an unsupervised measure of speech distortion estimated in a reference-free manner， the kurtosis 01 the speech power spectrum has been proposed by the authors， which is effective for optimizing parameters in the noise reduction method and predicting speech r巴cognition accuracy (5， 6] The main problem in our previous method (5] is the low robust ness in the estimation of higher-order statistics. 1n this method， it is necessary to calculate up to eighth-order statistics in the observed signal waveform domain. Since sixth- and eighth-order statistics are very sensitive to outliers， we cannot estimate them stably from ob servable 自nite samples， causing considerable degradation of the esti mated ku口osis of the speech power spec汀um. To solve this problem，. PREVIOUS WORKS. 4049. Moment-cumulant transformation. In this section， we give some formulas regarding the moment cumulant transformation. They explicitly represent the relations. ICASSP2012.

(2) between the mom巴nt and cumulant in each order， which are use ful for estimating the kurtosis of a speech power spectrum. The mth-order moment μm(Y) can be written as. μm(Y) = L II κIBI(Y)' π(m)BEπ(m). kurtosis of the speech power spectrum using sixth- and eighth-order statistics obtained from waveform signals in the time-frequency do main. More specifically， we calculate the kurtosis of the spe巴ch power spectrum by looking up values in a kllrtosis table， which rep resents the direct relation among the kurtosis of speech， noise， and observed (noisy speech) signals in the power spectrum domain To construct由e kurtosis table， we have to determine the math ematical r，巴lationship among the kurtosis of signals. However， there exist infinite pattems of signals that have an equivalent kurtosis value. Therefore， it is quite difficult to uniqu巴Iy determine each signal. To avoid this problem， we apply a statistical assumption to the waveform signals of speech and noise. In the following sectíon， we describ巴 the statistical assumption in detail. (2). whereπ(m) runs through th巴 list of all partitions of a set of size m， means that B is one of the blocks into which the set is partitioned， and IBI is the size of set B. In the same manner， the mth-order cumulant民m (y) is given by. Bεπ(m). 同m(Y) =乞(-1) 1 巾 )1 ー 1 ( 1π(m) 1 一叩 H μIBI (y). π(m) BE7<(m) 2.3.. Estimation of speech kurtosis from observations. (3). 3.2.. [5]. In the proposed method， we introduce the generalized Gaussian dis tribution for modeling the waveform signals of spe巴ch and noise The probability density function (p.d.f.) of the gen巴ralized Gaussian dis汀ibution is de自ned as. Hereafter， to deal with time-frequency-domain signals， we de白ne complex-valued variables of the observed (noisy speech) signal， the original speech signal， and the noise signal as (XR+iXI)， (SR 十日 1 ) ， and (nR+in!)， respectively， where XR = SR +nR and Xr = SI+nl hold. Only t出he s坑ta創叩ti臼stics of (XR + iXI) and (nR + inr) 訂e obs鍔即E訂r刊v aぬble， but t由ha剖t of (いSR + i“問s釘!) is a hidde【n value to be eωslima瓜t巴d. First， we measure the following mth-order moments from the data:. μm(XR) =μm(xr) = E[xR'L μm(nR) =μm(η r) = E[nR'L. p(y). (5). J- ∞. ( 6). 3.3.. Speech kurtosis estimation based on genera1ized Gaussian distribution prior. l (. 一一. ) -. h. R. 丸一ん一 ‘IJ，/てiJ /一 J 一山り A 一 1白土U H+ 一 ν 町円何一卯一. 一一. 向. R. μ. αs. u. where. r一 r一 d h. First， the mth-order moments of the waveform (time-frequency gird) signals of speech and noise are calculated as. (12). and αn are the scale paramet巴rs in the distributions for. speech and noíse signals， and ß. and ßn are lhe shape parameters in these distributions， respectively Next， the moment of the square of. (7). = 2 (μ4(XR)一μ4(πR)+ μ�(XR) - 8μ2(XR)μ2(nR)+ 7μ�(nR)) (8) ，. 3.1.. (10). β). The kurtosis table is constructed using (10) and the moment cumulant transformation in the next subsection.. λf(μm(XR)，μm(nR)) = μS(XR)一μs(nR) +[4μ2(XR) - 32μ2(nR)]μ6(XR) + [-32μ2(XR) + 60μ2(nR)]μ6(nR) + [-7 6μ4(nR) - 96μ2(XR)μ2(nR) + 516μ�(町)]μ4(XR) + [-60μ�(XR) + 1056μ2(XR)μ2(nR)ー1416μ�(ηR)]μ4(nR) +3μi(XR)十73μ�(nR) +468μ�(XR)μ�(nR) - 3456μ2(XR)μ�(nR)+ 2988μ�(nR)，. 3.. (9). 2αr( 1 /β). m m r ((m+ l)/ß) μm(Y) = If∞ ν p(y)dy = α r(l/. where. 。(μm(XR)，μm(nR)). = 一一 ι...，..-::-:- e - ( I山 ) β. whereαis a scale parameter，βis a shap巴 parameter， and r(-) de notes the gamma function. Next， the mth-order moment of th巴 gen eralized Gaussian distribution is given by. (4). where we assume that XR and X r are i.i.d.， and this also holds for the noise and observed signals In [5]， the kurtosis of the speech power spec町um is estimated from the following equation using (4)， (5)， and the additivity of cu mulants. (μm(町)，μm(nR)) kurtspe帥= 出丞と包= λf μ�(S� + s?) v V (μm(XR)，μm(nR)) ，. Parametric model. SR is given by. 2m r((2m + 1)/ßs) μm(S�) =μ2m(SR) = αs. (13). r(l/ßs). 2. Then we can calculate the cumulant of the po明er spectrum as. S� + s?. κm(S�+ sO =2κm(S�). PROPOSED METHOD. =2L(ー1 )1π(mド(1π(m)ト1)!日μIBI(S�)，. Problem of conventional method and motivation. The conventíonal method can estimate the kurtosis of a spe巴ch power spectrum wilhout a clean speech signal to some extent. However， the accuracy of spe氏h .kurtosis estimatíon using the conventional method is often very unstable because of the instability ín estimat ing very high order (sixth- and eighth-order) statístics oblained from 自nite samples. To avoid this， we propose to estimate speech kurto sis directly in the power spectrum domain instead of estimating lhe. π(m). BEπ(m). (14) and the mth-order moment of lhe power spectrum is given by. 4050. -166 -. μm(品+ S�) =乞日目IBI(S�+ si) 宵(m)BEπ(m). ( 15).

(3) wh巴re. わ-ed 一 d + 一 lT. Finally， using ( 1 1 ) and ( 13)ベ15)， th巴 kurtosis of the speech. power spectrum is derived as a function of the shape parameter ßs ，. ζu. 一一. A p一民 U川一. u. 一一. 2R 一 z R 2 h 、一s tM. LK. where. か唯仕)ゆ3+4r 訪(∞ま)ωぼ. =r. =. r. 附. 2rG:)r (土)3+4r伝)r(去rr(か2r(去r Next， in the same manner， the kurtosis of the noise power spec. trum is de自ned as. {r(f.)r(土)ぱωr(;.)+3rGJ 2}r(土rr(去r +0<吋2r(え)r GJ+96r(去)r(訪}r伝)r(土rr仕 r 3 ωw 7 6 +吋白〈心刈ベ:払叫吋白4ベ0<�{ 揃州州rべ(か(えYr(訪 + (f.rr(ず} r(土yr 位y =O<�. d咋2r(去)r仕)+96r(去)r(去)}r(ま)r(iYr(去r 吋(;Jr(土)+4r(え)吸いr(�r}r(irr(去y， (25). +. Ðx(ßs，O<s， ßn， O<n). =2 卜十伝)r(i)唯r } r(去y +0<;咋(か伝)r(i)r仕)} =r(か(去)'+4r(え)r(か(去)"+3r(ま)r(土r， + 作(去)r(か(まr} r(土r] kurtno回. 一一一ー μ4(n� n;) _ Nn(βn) 一一一一一一一す一一→ Ðn(ßn) μ2(n� 十nr)2. ( 19). where. From the above-mentioned resu1ts， we can con自rm that (a) the. speech kurtosis， kurtspeech， given by ( 16)， is a function of ßs only， and noise the kurtosis， kurtno目e， given by ( 19)， is a function of ßn. =2r(去)r (:去)3+4r(;Jr(去rr仕)+2r(まr. o凶y. Thus， they are independent variables. (b) The kurtosis of the. Next， we ca1cu1ate the kurtosis of the observed (speech-nois巴. +bめ) = κm( ) + κm(伶例bめ).. mixture) signal. General1y， the cumu1ant has additivity for additive independent va1uables， i 巴，. t出hi目s. κm(α. α. Us剖m. r陀elation and (3め)， we can estimate the cumu1ant of the observed. observed signal， kurtobse<ved， is a complex function of ßs ， βn ，白h and O<n. Thus， given o<s and白n， we can plot the value of kurtobse<ved on the two-dimensiona1 independent axes of kurtspeech (of ßs) and. kurtno闘(of ßn)， y ie1ding th巴 d巴sired kurtosis tab1e. The scale pa. rameter is defined using the variance and shape parameter of each signal as. signa1 as. (27). κm(Xll) =κm(SR) +κm(nR) =乞(ー1)1川1-1(1π(m)1一叩日μIBI(SR) π(m) BEπ(m) +乞(ー1)1，，(m)l-l(1π(m)l-l)! IIμIBI(nR) BE宵(m) π(m) (22) The moment of the square of. XR is given by. where. (j2. time period， and that of speech，σ; ， can be estimated in a l1oise-�>n1y c asσi - σ � . Thus， we can cons凶ct the kurtosis table at each input. SNR by varying the shape and sca1e parameters. Then， the ku口OSIS of the speech power spectrum can be estimated from the kurtosis of the power spectra of the noise and observed signals by looking Up values in the table. An example of such a kurtosis table is shown in. μm(X�) =μ川町) = 乞 II 1>IBI(XR)' π(2m) BEπ(2m). Fig. 1 . In this 品gure， the variance ratio of speech to noise signals is. (23). 自xed to unity. Thus， this table should be used when th巴 input SNR between the speech and noise signals is 0 dB.. Then， we can estimate the kurtosis of the observed signal in the power spectrum domain in a similar way to. is the variance of the signa1. We determine the sca1e pa. rameter from the input SNR， which expresses the v<l!'iance ratio of the spe配h to noise signals. The variance of noise，ば， is measured. (14)ベ16).. 4.. EXPERIMENTS. The kurtosis. of the observed signal in the power spectrum domain is calculated as. 入ん(ßs，O<s，ßn，白n) ku 凡h "'''' ..vørJ…= 旦並七豆L = 山山 μ2(X� +x;)2 Ðx(ßs，αs， ßn， On)'. 4.1.. Experimental setup. To con品rm the e仔巴ctiveness of the proposed method， we conducted. (24). an experiment on kurtosis estimation of the speech power spec 甘um. In this experiment， the conventional method and the proposed. 4051. -167-.

(4) Observed signal kurtosis 帥帥 O亡コu品。凶 oz. )(103 2.0. Table 1. Average normalized eπor of estimates of speech kurtosis in power spectrum domain using conventional and proposed methods for (i) white Gaussian noise， (ii) railway station noise， (iii) museum 目。ise， and (iv) babble noise. 1.6 1.2. Data length. 0.8. W hole sentence. 0.4 0. o. Method. Half of sentence. 0.5 1.0 1.5 2.0 2.5 3.0 3.5 . .10 Speech kurtosis. One second. Fig. 1. Kurtosis table in power spectrum domain when input SNR is OdB.. 口Clean spe田h signal 白Conventional method. I. the input SNR is -10 dB. In Fig. 2， although the kurtosis of the c1ean speech signal has quite a high value， the kurtosis of the observed. ・Observed signal ロProposed method. (noisy speech) signal is lower than that of the c1ean speech signal. Estimates c10se to original values can be obtained by using the con ventional and proposed methods. However， sometimes the estimates of speech kurtosis have a large eπor in the conventional method; in contrast，the proposed method's results are very stable and accurate. Table 1 shows the average normalized e町or of estimates of speech kurtosis in the power spectrum domain using the conven・ tional and proposed methods. In Table 1， estimates using the con ventional method for the museum noise and babble noise have quite a large eπor，especially in the case of 1 s data length. However，these eπors are reduced by using the proposed method. Also， in estimates for the white Gaussian noise and railway station noise， eπors are. lower for the proposed method. From these results， we can confirm that accuracy of speech kurtosis estimation is markedly improved using the proposed method.. 5. Fig. 2. Examples of speech kurtosis estimates for (i) white Gaussian noise，(ii) railway station noise， (iii) museum noise， and (iv) babble noise. Whole sentences are used in estimation method based on the generalized Gaussian distribution prior were compared. We used 2∞utterances (1∞males and 100 females from the Japanese newspaper dictation database) as the target speech signals and four types of noise signals， namely white Gaussian noise， rail way station noise， museum noise， and babble noise. Furthermore， we used three di仔erent types of data length: (a) the full length of each utterance of 2-15 s (whole sentence)， (b) the 自rst half of each utterance (half of sentence)， and (c) the first second of each utter ance (one second). The test data were obtained by combining the target speech signals and noise signals. AII signals used in this ex periment were 16・kHz-sampled signals. The input SNR of the test data was set to 0，・5， or -10 dB. The speech kurtosis in the power spectrum domain of the test data was estimated using the conven tional and proposed methods. In the proposed method， we con・ structed the kurtosis table for each SNR. In these kurtosis tables，. kurt，pPPCh. and kurtnoi'p were changed from 0 to 35∞by three. We calculated the normalized eπor of the estimates in the conven tional and proposed methods and compared the accuracy of speech kurtosis estimation.. The normalized eπor is de自ned as en肘nl. Ikurt。附Ip - kurt，pPPd.l/kurtoradp，. =. where kurtorarlp is the伊wer spectral kurtosis of the c1ean speech signal and kurt.叩pp，'h is the es. timate of the speech power spectral kurtosis. 4.2.. In this paper， we proposed a new method for the stable estimation of speech kurtosis in the power spectrum domain based on the general ized Gaussian distribution prior in order to avoid the calculation of higher-order statistics. Using this prior and the additivity of cumu lants， we can construct a kurtosis table that directly represents the relationship among the kurtosis of speech， noise ， and their mixture. in the power spectrum domain， and speech kurtosis can be estimated stably from the observable data. An experimental evaluation con・自rmed the efficacy of the proposed method. 6. [1] [2). (3 ) [4). [5勾). REFERENCES. P. C.. Loizou， Speech Enhancemellt Theol)・and Praζtlιe，CRC Press， Taylor & Francis Group FL， 2∞7 '‘ H. Saruwatari et al.， Blind source separation combining in dependent component analysis and beamforming，" EURASIP JOllmal on Applied Signal Processing.， vol.2003，口0.11， pp.1134-1 146， 2∞3 L. Rabiner et al.， FUlldamentals 01 speech吋cognition. Upper Saddle River. NJ: Prentice Hall PTR， 1993 Y. Uemura et al.， "Automatic optimization scheme of spectral subtraction based on musical noise assessment via higher-order statistics，" Proc. IWAENC2ω8，2∞8. R. Miザyazaki et al.，. s叩pa仙tJa剖1 subtraction紅ray. and. i山t“s a叩ppμ凶li比ca剖tiぬon tωo speech rec∞ogmト-. tlOn 口p戸erf，巾orma叩ncαep戸re吋di叩ctJ加1ωo叩n，" Pro附ι. HSCMA， pp.19-24， 2011 [ 6). [7 ). Experimental results. CONCLUSION. Figure 2 shows examples of speech kurtosis estimation results for 10 utterances using the conventional and proposed methods when. 4052. - 16 8 -. R. Wakisaka et al.， "Blind speech prior estimation for gener. alized minimum mean-square eπor short-time spectral ampli・. tude estimator，" ProιINTERSPEECH， pp.361-364， 2011. R.. Prasad et al.，“Probability distribution of time-series of. speech spectral components，" IEICE Tr a lls. FlIlldamentals， vol.E 8 7・A， no.3， pp.584-59 7， 2004..

(5)