Bandwidth Extension of Cellular Phone Speech Based on Maximum Likelihood Estimation with GMM
全文
(2) The GMM is trained with the EM algorithrn using the joint where m is a mixture component sequence [m 1,m2, •一,mt]. vectors in a training set. This training method robustly esti After determining the sub-optimum mixture sequence in writ mates model parameters compared with the least squares es ten as timation [ ]. 励=arg11FP(m|X,@ ), (8) 2.2. 民1:MSE-based conversion The conventional method perfoms the converSIon based we determine jb that maximizes the approximated llkellhood 白nction as follows: onMMSE as follows: Y = arg maxP(必IX,8)P(町X,in,8) Yt E fytlXtl y. Ip仏IXt,θ)Ytdy M. エP(mIXt,8)Em.r, m=1. where. (3). where. tJ. W. EmJ=d)+zZX)(zrx))一I仇- J1�))・. (5). 2.3. Problems Although MM SE has reasonably high conversion accuracy,there still remain some problems to be solved: 1) inap�ropria�e spectra} moveme�t_:; �e caused b� ignoring a correlation between frames,and 2) the converted spectra are excessively smoothed by the statistical modeling. 3. 島1:LE-based bandwidth èxtension. WTD日nz. (9). (n?\ z. m-I m Dh , ・ ・ ・,Dh -1 ,・・・ , DEll. dIag. Dn内z (4). -I. [ぽ,ll, df, ,d斗. (ηー1. ωmN(Xt ;μ 臼),zr)) M xx) ' I::J�AAJ) 2:7=1ωjN(Xt;J1 �AJ ,z. P(mIXt , 8 ). ( WTD�りz川. (10). (11). A G MM parameter set @is estimated in advance with training data in the same a manner as the converted method.. 3.2. MLE considering GV (MLGV) [ 1 The GV of the static feature vectors in each utterance is wntten as [v(I),v(2ν (12) ) . " V(D)]T vω). 品 品 r. v(d). 川一. (13 ). 川の. w叫he児附巴. 合ame t. ln order to solve two main problems of the conventional The following likelihood function consisting of two proba method,we proposed the bandwidth extension based onMLE bility densities for a sequence of the wideband feature vectors considering dynamic features and the GY. lnter-企ame co汀e and for the GV of the wideband static feature vectors is maxlation is considered for realizing the converted parameter甘か imized, jectory with appropriate dynamic characteristics. The over L = α10gP(YlX,仇,8) +logP(v(Y)18ν), smoothing effects are alleviated by considering the GV cap turing one of characteristics of the tr勾ectory. subject to Y = Wy ( 14). where P(v(Y)18v) is modeled by the normal distribution. A set of model parameters 8νconsists of the mean vector u(ν) and the covariance matrix :E (vv) for the G V vector v(y), which We use 2Dx-dimensional narrowband speech feature vector and 2Dy-dimensional wideband speech fea is also estimated in advance with回ining data. The constant Xt = αis a likelihood weight. ln血is P叩er,it is set to Note ture vector Yt = fyi ' 企 y , consisting of static and dynamic that the GV likelihood works as a penalty term for the oveト features at 合ame t. Their time sequences are written as smoothing. ー可T _ ー r _ r ー可 _ X= XいX�,・..,X� Tand Y = Y , Y�, ・ ,Y ,respec・ ln order to maximize the likelihood L with respect to y, we tively. A time sequence of the converted static feature vectors employ a steepest descent algorithrn using the first derivative , 3.1. MLE considering dynamic features (ML) [ 1. [xi, .ð..xif. I. ド何. I. J7. 去. if. I ;. 什. 「 凶帥rmined asおllows: 9 =. arg. δL. II7xP(町瓦8). subjecttoY =. Wy,. (6). where Wis a conversion matrix出at extends a sequence of the static feature vectors y into that of the static and dynamic features vectors. ln order to effectively reduce the computational cost,we approximate the likelihood as follows, P(町X,8) '"P(mIX,8)P(町X,m,8),. (7). (. vI'. =. v; (d ). =. - 228-. + [同ViT,. ,v;T,. [v;(1),v;(2),・・・ ,v;(の". ,vflT ,. . ". v�(D r ) ,. 伊 ゆμν中正のー やベ T. �dう is the. The vector p. - 284-. D(y)-l Wy+ WT D(y)-l IfY)). α _ WT. 。y. d・thωurnn vector of Pν=L(州. (15) (16). (17).
(3) 5. (1 )F�atuie extr.action (5)Up sampling. 日おl. |� 95%削id問団附rvals I. 4.5�… �. 同1. 8. 4�…… ・. ………・・. 235 E3← ・…… ト. O. (2)Fe③ture conv台rsion. 。. s2 @ 豆 Hい.“. 1.5ぃ EVRC. MMSE. ML. MLGV. Natural. Figure 2: Result of subjective evaluatìon.. 5.1.. Figure 1: Bandwidth extension system.“mcep" denotes the mel-cepstrum and “ap" denotes the aperiodic component.. 4.. Details of bandwidth extension process Figure 1 shows a flow of the bandwidth extension. Step 1 E xtracting Fo, mel-cepstrum and aperiodic com onent [las sp蹴h features from the naπow and speech signal.. t. Step2 Converting mel-cepstrum and aperiodic compo nent of the narrowband speech into those of the wideband speech. Step3 G enerating STR AIG HT mixed excitation [ ] us ing the extracted Fo叩d出e converted aperiodic component, and then synthesizing the estimated wideband speech withM LS A filter [ ] based on the converted mel-cepstrum. Step 4 The estimated wideband speech is separated into. a low-band signal and a high-band signal with a. low-pass filter (LPF)組d a high-pass filt疋r (HPF).. Experimental conditions. We used 16 kHz sampled natural speech of 4 Japanese speakers (2 males, 2 females) as the wìdeband speech. The 3.4 kHz narrowband speech was prepared by down-sampling the wideband speech and then passing it through EVRC (E n・ hancedVariable Rate Codec) [ ].The凶ining data was 50 sentences 合om subset A of ATR's phonetically balanced sen tence database. The evaluation data was 50 sentences 合om subset B of ATR's phonetically balanced sentence database. For narrowband,we used the 16・dimensional mel-cepstral co・ e節cients from the mel-cepstral analysis [ ]. For wideband, we used the 24・dimensional mel-cepstral coefficients 企om the STR AIG HT analysis [ ]. We used the averaged aperi odic components [ ] on three frequency bands ( 0 to 1, 1 to 2,2 to 4祉iz) for narrowband and those on five 企equency bands ( 0to 1, 1 to 2,2 to 4,4 to 6,6 to 8 kHz) for wideband. The frame shift was 5 ms. The number of mixture compo nents of the G 1仏-1for mel-cepstral conversìon was set to 64. The number of mixture components of出e GMM for aperi odic components conversion was set to 4.Speaker dependent models were evaluated. We conducted an opinion test on speech quality. An opin ion score was set to a 5・point scale (5: excellent,4: good,3: fair,2: poor, 1: bad). The evaluated speech samples consisted of EVRC,MM SE ,M L,M LGV,and wideband natural speech (Natural). The listeners were eight Japanese adult man and. woman.. Step 5 The input narrowband speech is converted into a low-band signal with up-sampling.. 5.2. Experimental results Step 6 Power of the estimated high-band signal is ad Figure 2 shows a result of the opinion test. There is no sig justed so that power of the estimated low-band sig nificant di仔érence between EVRC andMM SE . On the other nal is equal to that of the input low-band signal. hand, the proposed method M L is significantly better than Step 7 Reconstructing the wideband speech by adding the bo出 EVRC andMMSE . The reconstructed wideband speech resulting high-band signal and the low-band sig with the highest speech quality was obtained by consider nal. ing the GV as well in the proposed method. An example of spectral sequences of the narrowband speech, the recon structed speech and the natural wideband speech is shown in 5. Experimental evaluations Figure 3. The proゅosed method estimates spec仕al envelopes ln order to demonstrate the effectiveness of the proposed considering inter・合ame correlation while alleviating the over smoothing effects. method,we conducted a subjective evaluation.. - 285- 229 -.
(4) (a. 日 (b日 (引. 汚内 点日辺コ:J3; ぷ点:し机. ι日 日?� よ;�ν 外川舟汚伝5仰好吃ι野附似引机 . 'i;入j; 叫m引弘 仰汚1 付κ似可; コ、ト刊川 哨行仇ぐ/心似 刀 川 μ 川 μ i : iji )i. il; B. J1 『. 1ii. i. 1. 『J1i』 ' \;1. .. J. ::1 ;;: j:;l :il ,:!;i. jJ j EJ1 !. 1\t. !. 0i(i;:;:;! :;:!V計ぷ 山パ1'付日ii崎;1可 布市白i 必山�ωij. Figw'e 3: An example of spectral sequences of nan'owband speech, (a) converted speech by the ML using the GV ,(b) spectra of natural speech,(c) for a sentence fragment,"/ jy u: j i ts U sh 1 t e i k u r. [7]. 6. Conclusions We proposed bandwidth extension based on maximum likelihood estimation (MLE) with a Gaussian mixture model (GMM) considering dynamic features and the global variance (GV ). A result of the subjective evaluation demons甘ated that the speech quality of the narrowband speech is significantly improved by the proposed method. We plan to deal with a speaker independent model, online processing and noise ro・ bustness.. [8] [9]. 7.. Acknowledgments T his research was supported in part by e- Society pr句ect and KDDI collaborative research.. [1] B.Bessette, R. [2] [3]. References. [10]. Salami, R. Lefebvre, M. lelinek, 1. Potola-Pukkila, 1. Vainio, H. Mikkola and K. Jarvinen, uηle adaptive multirate wideband speech codec (AMR WB)," IEEE ηαns., Vol. 10,No. 8,pp. 62O-Q36,2002.. [11]. Y. Yoshida,and M. Abe,“An Algorithm to Reconstruct Wideband Speech From Narrowband Speech Based on Codebook Mapping," Proc. ICSLP94, pp.1591-1593, 1994.. [12]. K.y. Park and H.S. Kirn. “Narrowband to wideband conversion of speech using GMM based transforma tion," Proc. ICSLP, pp. 1847-1850, Istanbul, June, 2000.. [41L制iaru,?c?EI-- MO山::iiP2zm-[1 3 ] an5 }nn D r v01cp conv , , Speech and Audio Processing, Vol. 6,No.2, pp. 131一 142,1998.. [5] M.L. Seltzer, A. Acero, and 1. Droppo, “Robust Bandwidth Extension of Noise-corrupted Narrowband Speech," Proc. ICSLP, pp. 1509-1512,2005.. [6]. [14]. S. Yao and C.F.Chan,“Block-based Bandwidth Exten sion of Narrowband Speech Signal by using CDHMM," Proc. ICASSP, pp. 1793-1796,2005.. - 286- 230 -. S.Y Yao and c. F. Chan, “ Speech bandwidth en hancement using state space speech dynamics," Proc. ICASSP2006, pp. 1489-1492,2006. Y. Agiomyrgiannakis and Y. Stylianou,“Combined Es timafionjcoding of Highband Spectral Envelopes for Speech Spectrwn Expansion," Proc. ICASSP2004, pp. 469-472,2004. T. Toda,A.W.Black,and K. Tokuda,“Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory," IEEE Trans. ASLP, Vol. 15,No. 8,pp.2222-2234,2007. A. Kain and M.W.Macon.“ Spectral voice conversion for text-to・speech synthesis," Proc. ICASSP, Seattle, U.S.A,. pp. 285-288,May 2004. H.Kawahara, J o Estill and O. Fujirnぽa,“Aperiodicity extraction and control using mixed mode excitation and. group de!ay manipulation for a high quality speech anal. ysis, modification and synthesis system STRAIGHT," Proc.凡1AVEBA, Sep. 13-15, Firentze Italy,2001.. H.Kawahara,1. Masuda-Katsuse,and A. de Cheveigné, “Restructuring speech representations using a pitch-adaptive time-企equency smoothing and an instantaneous-合equency-based Fo extraction: Possi ble role of a repetitive 狩ucture in sounds," Speech Commun., Vol. 27,No. 3・4,pp. 187-207.1999.. 山間iムi郡山改叫;tfふh;;ぷ7. proc ICASSP,Voil ,pp.137 140, San Francisco,U SA, Mar.1992. T.V. Ramabadran, 1.P. Ashley and M.J. McLaughlin, “Background Noise Suppression for Speech Enhance ment and Coding九IEEE Workshop on Speech Coding and 1礼Pocono Manor,PA,pp.43-44,1997..
(5)
図
関連したドキュメント
2 Combining the lemma 5.4 with the main theorem of [SW1], we immediately obtain the following corollary.. Corollary 5.5 Let l > 3 be
In this work, we present an asymptotic analysis of a coupled sys- tem of two advection-diffusion-reaction equations with Danckwerts boundary conditions, which models the
Keywords: continuous time random walk, Brownian motion, collision time, skew Young tableaux, tandem queue.. AMS 2000 Subject Classification: Primary:
The aim of this work is to prove the uniform boundedness and the existence of global solutions for Gierer-Meinhardt model of three substance described by reaction-diffusion
As an application, in Section 5 we will use the former mirror coupling to give a unifying proof of Chavel’s conjecture on the domain monotonicity of the Neumann heat kernel for
Debreu’s Theorem ([1]) says that every n-component additive conjoint structure can be embedded into (( R ) n i=1 ,. In the introdution, the differences between the analytical and
To derive a weak formulation of (1.1)–(1.8), we first assume that the functions v, p, θ and c are a classical solution of our problem. 33]) and substitute the Neumann boundary
In order to study the rheological characteristics of magnetorheological fluids, a novel approach based on the two-component Lattice Boltzmann method with double meshes was proposed,