• 検索結果がありません。

Bandwidth Extension of Cellular Phone Speech Based on Maximum Likelihood Estimation with GMM

N/A
N/A
Protected

Academic year: 2021

シェア "Bandwidth Extension of Cellular Phone Speech Based on Maximum Likelihood Estimation with GMM"

Copied!
4
0
0

読み込み中.... (全文を見る)

全文

(1)-ト�. .2008 Intemational Workshop on Nonlinear Circuits and Signal Processing :し必ク':': NCSP'08, Gold Coast, Australia, March 6・8,2008. BANDWIDTH EX TENSION OF CELLULAR PHONE SPEECH BASED ON MAXIMUM LIKELIHOOD ES TIMATION、NITHGMM mαtaru F;吋itsurut, Hidehiko Sekimotot, Tomoki Todat, Hiroshi S,ω'uwα協同αnd Kiyohiro Shikano t tGraduate School ofInformation Science, Nara 1nstitute of Science and Technology, Nara‘631-0101 Japan E-mail: twataru-f.hidehi-s.tomoki.sélwéltari.shikano(?!;is.naist.jp. Abstract Bandwidth extension is a useful technique for reconstructing wideband speech 合om only narrowband speech. As a typ­ ical conventional method, a bandwidth extension algorithm based on minimum mean square eηor (MMSE) with a Gaus­ sian mixture model (GMMYhas been proposed []. Although 出e MMSE・based method has reasonably high conversion­ accuracy, there still remain some problems to be solved: 1) inappropriate spectral movements are caused by ignoring a correlation between合ames, and 2) the converted spectra are excessively smoothed by the statistical modeling. In or­ der to address these problems, we propose a bandwidth ex・ tension algorithm based on maximum likelihood estimation (MLE) considering dynamic features and the global variance (GV) with a GMM. A result of a su句ective test demon­ strates that the proposed algorithm outperfoηns the conven­ tional MMSE-based algorithm.. [, ]. Moreover, an approach of combining mapping and coding processes has been studied []. Recently the performance of voice conversion with a GMM was significantly improved by maximum likelihood estima­ tion (MLE) considering dynamic features and the global vari­ ance (GV) []. It is expected that it also causes the perfor・ mance improvement of bandwidth extension. This paper pro­ poses bandwidth extension based on MLE considering dy­ namic features and the Gv. The e仔'ectiveness of considering dynamic features and the GV is demonstrated by a subjective evaluation. This paper is organized as follows. Section 2 describes the conventional method of the bandwidth extension algorithm based on GMM. Section 3 describes the proposed bandwidth extension based on MLE. Section 4 describes the flow of the bandwidth extension. Section 5 describes an experimental evaluation. Finally, we summarize也is paper in Section 6. 2.. 1.. Introduction. M恥ISE・based bandwidth extension [ 1. 2.1.. The use of cellular phones enables us to easily communi­ cate with each other through speech. In general, the cellu­ lar phone speech is limited to a narrowband signal up to 3.4 kHz. Although narrowband speech is capable of intelligible communication, its sound quality is not high enough. Sp巴cifi­ cally considerable quality degradation is observable at several phonemes such as合icatives and plosives which have im卵子 tant energy distribution beyond 3.4 kHz. 1n order to real­ ize higheトquality speech communication, a wideband speech codec-has 6een developed []. Essentially it needs more in­ formation than the narrowband speech codec. It is no doubt­ ful that to realize wideband speech communication while not increasing information is more convenient. Bandwidth extension is a useful technique for recons汀uct­ ing wideband speech from only narrowband speech. Sev­ enil statistical approaches to bandwidth extension based on a spec甘al mapping have been studied. A codebook mapping method [] is an approach based on hard clustering and dis­ crete mapping. A reconstructed feature vector at each企ame is detemiíned by quantizing the narrowband speech feature vector to the nearest centroid vector of the narrowband code­ book and substituting it with a coπesponding wideband cen­ troid vector of the mapping codebook. One of more sophis­ ticated approaches allowing soft clustering and continuous mapping is a probabilistic bandwidth ext_en�ion method based on a-Gaussian mixture model (GMM) [ ]. The basic mapping algorithm has originally been proposed for voice conversion [1. Most of the conventional GMM-based methods perform ihe mapping based on MMSE criterion [, ]. It has been reporteô- that the mapping performance is further improved by modeling dynamic characteristics of a spectral sequence. Training. Let Xt and Yt be Dx-dimensional narrowband and Dy­ dimensional wideband feature vectors at企ame t , respectively. Joint probability density of the narrowband and wideband feature vectors is modeled by a GMM as follows: M. P(ゐ|θ) =. L:ωmN(Zt;Jl. m=l. where Zt is a joint vector [x; , Y; . The notation T denotes transposition of the vector. Thè number of mixture compo­ nents is M. The weight of the m-th mixture component isωm. is denoted as N(・;μエ). The normal distribution withμand A parameter set of the GMM isθ, which consists of weights, mean vectors and the covariance ma仕ices for individual mix­ ture components. The mean vectorμ� } and the covariance. r. 1:. matrix. 1:�) of the m-th mixture component are w出en as 戸z) m. Jl�). I W) zR l =川町 lw) Iμ�) l ' “ m - 1 1:�x) 1:�y) J. (2). Jl�). where and are the mean vector of the m-th mixture component for the narrowband and that for the wideband, re-. 1:�x). 1:�y). spectively, and and are the covariance matrix of the m-th mixture component for the narrowband and that for. 1:�x). the wideba叫respectively. The matrix is the cros岱s covariance ma甘出ix 0ぱf t白he m-th mixture c∞omponent for the na釘r­ rowba組nd and wi凶deba加nd.. 今、d門t QO ワ臼 ゥ,B ワ臼.

(2) The GMM is trained with the EM algorithrn using the joint where m is a mixture component sequence [m 1,m2, •一,mt]. vectors in a training set. This training method robustly esti­ After determining the sub-optimum mixture sequence in writ­ mates model parameters compared with the least squares es­ ten as timation [ ]. 励=arg11FP(m|X,@ ), (8) 2.2. 民1:MSE-based conversion The conventional method perfoms the converSIon based we determine jb that maximizes the approximated llkellhood 白nction as follows: onMMSE as follows: Y = arg maxP(必IX,8)P(町X,in,8) Yt E fytlXtl y. Ip仏IXt,θ)Ytdy M. エP(mIXt,8)Em.r, m=1. where. (3). where. tJ. W. EmJ=d)+zZX)(zrx))一I仇- J1�))・. (5). 2.3. Problems Although MM SE has reasonably high conversion­ accuracy,there still remain some problems to be solved: 1) inap�ropria�e spectra} moveme�t_:; �e caused b� ignoring a correlation between frames,and 2) the converted spectra are excessively smoothed by the statistical modeling. 3. 島1:LE-based bandwidth èxtension. WTD日nz. (9). (n?\ z. m-I m Dh , ・ ・ ・,Dh -1 ,・・・ , DEll. dIag. Dn内z (4). -I. [ぽ,ll, df, ,d斗. (ηー1. ωmN(Xt ;μ 臼),zr)) M xx) ' I::J�AAJ) 2:7=1ωjN(Xt;J1 �AJ ,z. P(mIXt , 8 ). ( WTD�りz川. (10). (11). A G MM parameter set @is estimated in advance with training data in the same a manner as the converted method.. 3.2. MLE considering GV (MLGV) [ 1 The GV of the static feature vectors in each utterance is wntten as [v(I),v(2ν (12) ) . " V(D)]T vω). 品 品 r. v(d). 川一. (13 ). 川の. w叫he児附巴. 合ame t. ln order to solve two main problems of the conventional The following likelihood function consisting of two proba­ method,we proposed the bandwidth extension based onMLE bility densities for a sequence of the wideband feature vectors considering dynamic features and the GY. lnter-企ame co汀e­ and for the GV of the wideband static feature vectors is maxlation is considered for realizing the converted parameter甘か imized, jectory with appropriate dynamic characteristics. The over­ L = α10gP(YlX,仇,8) +logP(v(Y)18ν), smoothing effects are alleviated by considering the GV cap­ turing one of characteristics of the tr勾ectory. subject to Y = Wy ( 14). where P(v(Y)18v) is modeled by the normal distribution. A set of model parameters 8νconsists of the mean vector u(ν) and the covariance matrix :E (vv) for the G V vector v(y), which We use 2Dx-dimensional narrowband speech feature vector and 2Dy-dimensional wideband speech fea­ is also estimated in advance with回ining data. The constant Xt = αis a likelihood weight. ln血is P叩er,it is set to Note ture vector Yt = fyi ' 企 y , consisting of static and dynamic that the GV likelihood works as a penalty term for the oveト features at 合ame t. Their time sequences are written as smoothing. ー可T _ ー r _ r ー可 _ X= XいX�,・..,X� Tand Y = Y , Y�, ・ ,Y ,respec・ ln order to maximize the likelihood L with respect to y, we tively. A time sequence of the converted static feature vectors employ a steepest descent algorithrn using the first derivative , 3.1. MLE considering dynamic features (ML) [ 1. [xi, .ð..xif. I. ド何. I. J7. 去. if. I ;. 什. 「 凶帥rmined asおllows: 9 =. arg. δL. II7xP(町瓦8). subjecttoY =. Wy,. (6). where Wis a conversion matrix出at extends a sequence of the static feature vectors y into that of the static and dynamic features vectors. ln order to effectively reduce the computational cost,we approximate the likelihood as follows, P(町X,8) '"P(mIX,8)P(町X,m,8),. (7). (. vI'. =. v; (d ). =. - 228-. + [同ViT,. ,v;T,. [v;(1),v;(2),・・・ ,v;(の". ,vflT ,. . ". v�(D r ) ,. 伊 ゆμν中正のー やベ T. �dう is the. The vector p. - 284-. D(y)-l Wy+ WT D(y)-l IfY)). α _ WT. 。y. d・thωurnn vector of Pν=L(州. (15) (16). (17).

(3) 5. (1 )F�atuie extr.action (5)Up sampling. 日おl. |� 95%削id問団附rvals I. 4.5�… �. 同1. 8. 4�…… ・. ………・・. 235 E3← ・…… ト. O. (2)Fe③ture conv台rsion. 。. s2 @ 豆 Hい.“. 1.5ぃ EVRC. MMSE. ML. MLGV. Natural. Figure 2: Result of subjective evaluatìon.. 5.1.. Figure 1: Bandwidth extension system.“mcep" denotes the mel-cepstrum and “ap" denotes the aperiodic component.. 4.. Details of bandwidth extension process Figure 1 shows a flow of the bandwidth extension. Step 1 E xtracting Fo, mel-cepstrum and aperiodic com­ onent [las sp蹴h features from the naπow­ and speech signal.. t. Step2 Converting mel-cepstrum and aperiodic compo­ nent of the narrowband speech into those of the wideband speech. Step3 G enerating STR AIG HT mixed excitation [ ] us­ ing the extracted Fo叩d出e converted aperiodic component, and then synthesizing the estimated wideband speech withM LS A filter [ ] based on the converted mel-cepstrum. Step 4 The estimated wideband speech is separated into. a low-band signal and a high-band signal with a. low-pass filter (LPF)組d a high-pass filt疋r (HPF).. Experimental conditions. We used 16 kHz sampled natural speech of 4 Japanese speakers (2 males, 2 females) as the wìdeband speech. The 3.4 kHz narrowband speech was prepared by down-sampling the wideband speech and then passing it through EVRC (E n・ hancedVariable Rate Codec) [ ].The凶ining data was 50 sentences 合om subset A of ATR's phonetically balanced sen­ tence database. The evaluation data was 50 sentences 合om subset B of ATR's phonetically balanced sentence database. For narrowband,we used the 16・dimensional mel-cepstral co・ e節cients from the mel-cepstral analysis [ ]. For wideband, we used the 24・dimensional mel-cepstral coefficients 企om the STR AIG HT analysis [ ]. We used the averaged aperi­ odic components [ ] on three frequency bands ( 0 to 1, 1 to 2,2 to 4祉iz) for narrowband and those on five 企equency bands ( 0to 1, 1 to 2,2 to 4,4 to 6,6 to 8 kHz) for wideband. The frame shift was 5 ms. The number of mixture compo­ nents of the G 1仏-1for mel-cepstral conversìon was set to 64. The number of mixture components of出e GMM for aperi­ odic components conversion was set to 4.Speaker dependent models were evaluated. We conducted an opinion test on speech quality. An opin­ ion score was set to a 5・point scale (5: excellent,4: good,3: fair,2: poor, 1: bad). The evaluated speech samples consisted of EVRC,MM SE ,M L,M LGV,and wideband natural speech (Natural). The listeners were eight Japanese adult man and. woman.. Step 5 The input narrowband speech is converted into a low-band signal with up-sampling.. 5.2. Experimental results Step 6 Power of the estimated high-band signal is ad­ Figure 2 shows a result of the opinion test. There is no sig­ justed so that power of the estimated low-band sig­ nificant di仔érence between EVRC andMM SE . On the other nal is equal to that of the input low-band signal. hand, the proposed method M L is significantly better than Step 7 Reconstructing the wideband speech by adding the bo出 EVRC andMMSE . The reconstructed wideband speech resulting high-band signal and the low-band sig­ with the highest speech quality was obtained by consider­ nal. ing the GV as well in the proposed method. An example of spectral sequences of the narrowband speech, the recon­ structed speech and the natural wideband speech is shown in 5. Experimental evaluations Figure 3. The proゅosed method estimates spec仕al envelopes ln order to demonstrate the effectiveness of the proposed considering inter・合ame correlation while alleviating the over­ smoothing effects. method,we conducted a subjective evaluation.. - 285- 229 -.

(4) (a. 日 (b日 (引. 汚内 点日辺コ:J3; ぷ点:し机. ι日 日?� よ;�ν 外川舟汚伝5仰好吃ι野附似引机 . 'i;入j; 叫m引弘 仰汚1 付κ似可; コ、ト刊川 哨行仇ぐ/心似 刀 川 μ 川 μ i : iji )i. il; B. J1 『. 1ii. i. 1. 『J1i』 ' \;1. .. J. ::1 ;;: j:;l :il ,:!;i. jJ j EJ1 !. 1\t. !. 0i(i;:;:;! :;:!V計ぷ 山パ1'付日ii崎;1可 布市白i 必山�ωij. Figw'e 3: An example of spectral sequences of nan'owband speech, (a) converted speech by the ML using the GV ,(b) spectra of natural speech,(c) for a sentence fragment,"/ jy u: j i ts U sh 1 t e i k u r. [7]. 6. Conclusions We proposed bandwidth extension based on maximum likelihood estimation (MLE) with a Gaussian mixture model (GMM) considering dynamic features and the global variance (GV ). A result of the subjective evaluation demons甘ated that the speech quality of the narrowband speech is significantly improved by the proposed method. We plan to deal with a speaker independent model, online processing and noise ro・ bustness.. [8] [9]. 7.. Acknowledgments T his research was supported in part by e- Society pr句ect and KDDI collaborative research.. [1] B.Bessette, R. [2] [3]. References. [10]. Salami, R. Lefebvre, M. lelinek, 1. Potola-Pukkila, 1. Vainio, H. Mikkola and K. Jarvinen, uηle adaptive multirate wideband speech codec (AMR­ WB)," IEEE ηαns., Vol. 10,No. 8,pp. 62O-Q36,2002.. [11]. Y. Yoshida,and M. Abe,“An Algorithm to Reconstruct Wideband Speech From Narrowband Speech Based on Codebook Mapping," Proc. ICSLP94, pp.1591-1593, 1994.. [12]. K.y. Park and H.S. Kirn. “Narrowband to wideband conversion of speech using GMM based transforma­ tion," Proc. ICSLP, pp. 1847-1850, Istanbul, June, 2000.. [41L制iaru,?c?EI-- MO山::iiP2zm-[1 3 ] an5 }nn D r v01cp conv , , Speech and Audio Processing, Vol. 6,No.2, pp. 131一 142,1998.. [5] M.L. Seltzer, A. Acero, and 1. Droppo, “Robust Bandwidth Extension of Noise-corrupted Narrowband Speech," Proc. ICSLP, pp. 1509-1512,2005.. [6]. [14]. S. Yao and C.F.Chan,“Block-based Bandwidth Exten­ sion of Narrowband Speech Signal by using CDHMM," Proc. ICASSP, pp. 1793-1796,2005.. - 286- 230 -. S.Y Yao and c. F. Chan, “ Speech bandwidth en­ hancement using state space speech dynamics," Proc. ICASSP2006, pp. 1489-1492,2006. Y. Agiomyrgiannakis and Y. Stylianou,“Combined Es­ timafionjcoding of Highband Spectral Envelopes for Speech Spectrwn Expansion," Proc. ICASSP2004, pp. 469-472,2004. T. Toda,A.W.Black,and K. Tokuda,“Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory," IEEE Trans. ASLP, Vol. 15,No. 8,pp.2222-2234,2007. A. Kain and M.W.Macon.“ Spectral voice conversion for text-to・speech synthesis," Proc. ICASSP, Seattle, U.S.A,. pp. 285-288,May 2004. H.Kawahara, J o Estill and O. Fujirnぽa,“Aperiodicity extraction and control using mixed mode excitation and. group de!ay manipulation for a high quality speech anal­. ysis, modification and synthesis system STRAIGHT," Proc.凡1AVEBA, Sep. 13-15, Firentze Italy,2001.. H.Kawahara,1. Masuda-Katsuse,and A. de Cheveigné, “Restructuring speech representations using a pitch-adaptive time-企equency smoothing and an instantaneous-合equency-based Fo extraction: Possi­ ble role of a repetitive 狩ucture in sounds," Speech Commun., Vol. 27,No. 3・4,pp. 187-207.1999.. 山間iムi郡山改叫;tfふh;;ぷ7. proc ICASSP,Voil ,pp.137 140, San Francisco,U SA, Mar.1992. T.V. Ramabadran, 1.P. Ashley and M.J. McLaughlin, “Background Noise Suppression for Speech Enhance­ ment and Coding九IEEE Workshop on Speech Coding and 1礼Pocono Manor,PA,pp.43-44,1997..

(5)

Figure 2 shows a result of the opinion test. There is no sig­

参照

関連したドキュメント

2 Combining the lemma 5.4 with the main theorem of [SW1], we immediately obtain the following corollary.. Corollary 5.5 Let l > 3 be

In this work, we present an asymptotic analysis of a coupled sys- tem of two advection-diffusion-reaction equations with Danckwerts boundary conditions, which models the

Keywords: continuous time random walk, Brownian motion, collision time, skew Young tableaux, tandem queue.. AMS 2000 Subject Classification: Primary:

The aim of this work is to prove the uniform boundedness and the existence of global solutions for Gierer-Meinhardt model of three substance described by reaction-diffusion

As an application, in Section 5 we will use the former mirror coupling to give a unifying proof of Chavel’s conjecture on the domain monotonicity of the Neumann heat kernel for

Debreu’s Theorem ([1]) says that every n-component additive conjoint structure can be embedded into (( R ) n i=1 ,. In the introdution, the differences between the analytical and

To derive a weak formulation of (1.1)–(1.8), we first assume that the functions v, p, θ and c are a classical solution of our problem. 33]) and substitute the Neumann boundary

In order to study the rheological characteristics of magnetorheological fluids, a novel approach based on the two-component Lattice Boltzmann method with double meshes was proposed,