Bandwidth Extension of Cellular Phone Speech Based on Maximum Likelihood Estimation with GMM

全文

(1)-ト�. .2008 Intemational Workshop on Nonlinear Circuits and Signal Processing :し必ク':': NCSP'08， Gold Coast， Australia， March 6・8，2008. BANDWIDTH EX TENSION OF CELLULAR PHONE SPEECH BASED ON MAXIMUM LIKELIHOOD ES TIMATION、NITHGMM mαtaru F;吋itsurut， Hidehiko Sekimotot， Tomoki Todat， Hiroshi S，ω'uwα協同αnd Kiyohiro Shikano t tGraduate School ofInformation Science， Nara 1nstitute of Science and Technology， Nara‘631-0101 Japan E-mail: twataru-f.hidehi-s.tomoki.sélwéltari.shikano(?!;is.naist.jp. Abstract Bandwidth extension is a useful technique for reconstructing wideband speech 合om only narrowband speech. As a typ ical conventional method， a bandwidth extension algorithm based on minimum mean square eηor (MMSE) with a Gaus sian mixture model (GMMYhas been proposed []. Although 出e MMSE・based method has reasonably high conversion accuracy， there still remain some problems to be solved: 1) inappropriate spectral movements are caused by ignoring a correlation between合ames， and 2) the converted spectra are excessively smoothed by the statistical modeling. In or der to address these problems， we propose a bandwidth ex・ tension algorithm based on maximum likelihood estimation (MLE) considering dynamic features and the global variance (GV) with a GMM. A result of a su句ective test demon strates that the proposed algorithm outperfoηns the conven tional MMSE-based algorithm.. [， ]. Moreover， an approach of combining mapping and coding processes has been studied []. Recently the performance of voice conversion with a GMM was significantly improved by maximum likelihood estima tion (MLE) considering dynamic features and the global vari ance (GV) []. It is expected that it also causes the perfor・ mance improvement of bandwidth extension. This paper pro poses bandwidth extension based on MLE considering dy namic features and the Gv. The e仔'ectiveness of considering dynamic features and the GV is demonstrated by a subjective evaluation. This paper is organized as follows. Section 2 describes the conventional method of the bandwidth extension algorithm based on GMM. Section 3 describes the proposed bandwidth extension based on MLE. Section 4 describes the flow of the bandwidth extension. Section 5 describes an experimental evaluation. Finally， we summarize也is paper in Section 6. 2.. 1.. Introduction. M恥ISE・based bandwidth extension [ 1. 2.1.. The use of cellular phones enables us to easily communi cate with each other through speech. In general， the cellu lar phone speech is limited to a narrowband signal up to 3.4 kHz. Although narrowband speech is capable of intelligible communication， its sound quality is not high enough. Sp巴cifi cally considerable quality degradation is observable at several phonemes such as合icatives and plosives which have im卵子 tant energy distribution beyond 3.4 kHz. 1n order to real ize higheトquality speech communication， a wideband speech codec-has 6een developed []. Essentially it needs more in formation than the narrowband speech codec. It is no doubt ful that to realize wideband speech communication while not increasing information is more convenient. Bandwidth extension is a useful technique for recons汀uct ing wideband speech from only narrowband speech. Sev enil statistical approaches to bandwidth extension based on a spec甘al mapping have been studied. A codebook mapping method [] is an approach based on hard clustering and dis crete mapping. A reconstructed feature vector at each企ame is detemiíned by quantizing the narrowband speech feature vector to the nearest centroid vector of the narrowband code book and substituting it with a coπesponding wideband cen troid vector of the mapping codebook. One of more sophis ticated approaches allowing soft clustering and continuous mapping is a probabilistic bandwidth ext_en�ion method based on a-Gaussian mixture model (GMM) [ ]. The basic mapping algorithm has originally been proposed for voice conversion [1. Most of the conventional GMM-based methods perform ihe mapping based on MMSE criterion [， ]. It has been reporteô- that the mapping performance is further improved by modeling dynamic characteristics of a spectral sequence. Training. Let Xt and Yt be Dx-dimensional narrowband and Dy dimensional wideband feature vectors at企ame t ， respectively. Joint probability density of the narrowband and wideband feature vectors is modeled by a GMM as follows: M. P(ゐ|θ) =. L:ωmN(Zt;Jl. m=l. where Zt is a joint vector [x; ， Y; . The notation T denotes transposition of the vector. Thè number of mixture compo nents is M. The weight of the m-th mixture component isωm. is denoted as N(・;μエ). The normal distribution withμand A parameter set of the GMM isθ， which consists of weights， mean vectors and the covariance ma仕ices for individual mix ture components. The mean vectorμ� } and the covariance. r. 1:. matrix. 1:�) of the m-th mixture component are w出en as 戸z) m. Jl�). I W) zR l =川町 lw) Iμ�) l ' “ m - 1 1:�x) 1:�y) J. (2). Jl�). where and are the mean vector of the m-th mixture component for the narrowband and that for the wideband， re-. 1:�x). 1:�y). spectively， and and are the covariance matrix of the m-th mixture component for the narrowband and that for. 1:�x). the wideba叫respectively. The matrix is the cros岱s covariance ma甘出ix 0ぱf t白he m-th mixture c∞omponent for the na釘r rowba組nd and wi凶deba加nd.. 今、d門t QO ワ臼ゥ，B ワ臼.

(2) The GMM is trained with the EM algorithrn using the joint where m is a mixture component sequence [m 1，m2， •一，mt]. vectors in a training set. This training method robustly esti After determining the sub-optimum mixture sequence in writ mates model parameters compared with the least squares es ten as timation [ ]. 励=arg11FP(m|X，@ )， (8) 2.2. 民1:MSE-based conversion The conventional method perfoms the converSIon based we determine jb that maximizes the approximated llkellhood 白nction as follows: onMMSE as follows: Y = arg maxP(必IX，8)P(町X，in，8) Yt E fytlXtl y. Ip仏IXt，θ)Ytdy M. エP(mIXt，8)Em.r， m=1. where. (3). where. tJ. W. EmJ=d)+zZX)(zrx))一I仇- J1�))・. (5). 2.3. Problems Although MM SE has reasonably high conversion accuracy，there still remain some problems to be solved: 1) inap�ropria�e spectra} moveme�t_:; �e caused b� ignoring a correlation between frames，and 2) the converted spectra are excessively smoothed by the statistical modeling. 3. 島1:LE-based bandwidth èxtension. WTD日nz. (9). (n?\ z. m-I m Dh ，・・・，Dh -1 ，・・・， DEll. dIag. Dn内z (4). -I. [ぽ，ll， df，，d斗. (ηー1. ωmN(Xt ;μ 臼)，zr)) M xx) ' I::J�AAJ) 2:7=1ωjN(Xt;J1 �AJ ，z. P(mIXt ， 8 ). ( WTD�りz川. (10). (11). A G MM parameter set @is estimated in advance with training data in the same a manner as the converted method.. 3.2. MLE considering GV (MLGV) [ 1 The GV of the static feature vectors in each utterance is wntten as [v(I)，v(2ν (12) ) . " V(D)]T vω). 品品 r. v(d). 川一. (13 ). 川の. w叫he児附巴. 合ame t. ln order to solve two main problems of the conventional The following likelihood function consisting of two proba method，we proposed the bandwidth extension based onMLE bility densities for a sequence of the wideband feature vectors considering dynamic features and the GY. lnter-企ame co汀e and for the GV of the wideband static feature vectors is maxlation is considered for realizing the converted parameter甘か imized， jectory with appropriate dynamic characteristics. The over L = α10gP(YlX，仇，8) +logP(v(Y)18ν)， smoothing effects are alleviated by considering the GV cap turing one of characteristics of the tr勾ectory. subject to Y = Wy ( 14). where P(v(Y)18v) is modeled by the normal distribution. A set of model parameters 8νconsists of the mean vector u(ν) and the covariance matrix :E (vv) for the G V vector v(y)， which We use 2Dx-dimensional narrowband speech feature vector and 2Dy-dimensional wideband speech fea is also estimated in advance with回ining data. The constant Xt = αis a likelihood weight. ln血is P叩er，it is set to Note ture vector Yt = fyi ' 企 y ， consisting of static and dynamic that the GV likelihood works as a penalty term for the oveト features at 合ame t. Their time sequences are written as smoothing. ー可T _ ー r _ r ー可 _ X= XいX�，・..，X� Tand Y = Y ， Y�，・，Y ，respec・ ln order to maximize the likelihood L with respect to y， we tively. A time sequence of the converted static feature vectors employ a steepest descent algorithrn using the first derivative ， 3.1. MLE considering dynamic features (ML) [ 1. [xi， .ð..xif. I. ド何. I. J7. 去. if. I ;. 什. 「凶帥rmined asおllows: 9 =. arg. δL. II7xP(町瓦8). subjecttoY =. Wy，. (6). where Wis a conversion matrix出at extends a sequence of the static feature vectors y into that of the static and dynamic features vectors. ln order to effectively reduce the computational cost，we approximate the likelihood as follows， P(町X，8) '"P(mIX，8)P(町X，m，8)，. (7). (. vI'. =. v; (d ). =. - 228-. + [同ViT，. ，v;T，. [v;(1)，v;(2)，・・・，v;(の". ，vflT ，. . ". v�(D r ) ，. 伊ゆμν中正のーやベ T. �dう is the. The vector p. - 284-. D(y)-l Wy+ WT D(y)-l IfY)). α _ WT. 。y. d・thωurnn vector of Pν=L(州. (15) (16). (17).

(3) 5. (1 )F�atuie extr.action (5)Up sampling. 日おl. |� 95%削id問団附rvals I. 4.5�… �. 同1. 8. 4�…… ・. ………・・. 235 E3← ・…… ト. O. (2)Fe③ture conv台rsion. 。. s2 @ 豆 Hい.“. 1.5ぃ EVRC. MMSE. ML. MLGV. Natural. Figure 2: Result of subjective evaluatìon.. 5.1.. Figure 1: Bandwidth extension system.“mcep" denotes the mel-cepstrum and “ap" denotes the aperiodic component.. 4.. Details of bandwidth extension process Figure 1 shows a flow of the bandwidth extension. Step 1 E xtracting Fo， mel-cepstrum and aperiodic com onent [las sp蹴h features from the naπow and speech signal.. t. Step2 Converting mel-cepstrum and aperiodic compo nent of the narrowband speech into those of the wideband speech. Step3 G enerating STR AIG HT mixed excitation [ ] us ing the extracted Fo叩d出e converted aperiodic component， and then synthesizing the estimated wideband speech withM LS A filter [ ] based on the converted mel-cepstrum. Step 4 The estimated wideband speech is separated into. a low-band signal and a high-band signal with a. low-pass filter (LPF)組d a high-pass filt疋r (HPF).. Experimental conditions. We used 16 kHz sampled natural speech of 4 Japanese speakers (2 males， 2 females) as the wìdeband speech. The 3.4 kHz narrowband speech was prepared by down-sampling the wideband speech and then passing it through EVRC (E n・ hancedVariable Rate Codec) [ ].The凶ining data was 50 sentences 合om subset A of ATR's phonetically balanced sen tence database. The evaluation data was 50 sentences 合om subset B of ATR's phonetically balanced sentence database. For narrowband，we used the 16・dimensional mel-cepstral co・ e節cients from the mel-cepstral analysis [ ]. For wideband， we used the 24・dimensional mel-cepstral coefficients 企om the STR AIG HT analysis [ ]. We used the averaged aperi odic components [ ] on three frequency bands ( 0 to 1， 1 to 2，2 to 4祉iz) for narrowband and those on five 企equency bands ( 0to 1， 1 to 2，2 to 4，4 to 6，6 to 8 kHz) for wideband. The frame shift was 5 ms. The number of mixture compo nents of the G 1仏-1for mel-cepstral conversìon was set to 64. The number of mixture components of出e GMM for aperi odic components conversion was set to 4.Speaker dependent models were evaluated. We conducted an opinion test on speech quality. An opin ion score was set to a 5・point scale (5: excellent，4: good，3: fair，2: poor， 1: bad). The evaluated speech samples consisted of EVRC，MM SE ，M L，M LGV，and wideband natural speech (Natural). The listeners were eight Japanese adult man and. woman.. Step 5 The input narrowband speech is converted into a low-band signal with up-sampling.. 5.2. Experimental results Step 6 Power of the estimated high-band signal is ad Figure 2 shows a result of the opinion test. There is no sig justed so that power of the estimated low-band sig nificant di仔érence between EVRC andMM SE . On the other nal is equal to that of the input low-band signal. hand， the proposed method M L is significantly better than Step 7 Reconstructing the wideband speech by adding the bo出 EVRC andMMSE . The reconstructed wideband speech resulting high-band signal and the low-band sig with the highest speech quality was obtained by consider nal. ing the GV as well in the proposed method. An example of spectral sequences of the narrowband speech， the recon structed speech and the natural wideband speech is shown in 5. Experimental evaluations Figure 3. The proゅosed method estimates spec仕al envelopes ln order to demonstrate the effectiveness of the proposed considering inter・合ame correlation while alleviating the over smoothing effects. method，we conducted a subjective evaluation.. - 285- 229 -.

(4) (a. 日 (b日 (引. 汚内点日辺コ:J3; ぷ点:し机. ι日日?� よ;�ν 外川舟汚伝5仰好吃ι野附似引机 . 'i;入j; 叫m引弘仰汚1 付κ似可; コ、ト刊川哨行仇ぐ/心似刀川 μ 川 μ i : iji )i. il; B. J1 『. 1ii. i. 1. 『J1i』 ' \;1. .. J. ::1 ;;: j:;l :il ，:!;i. jJ j EJ1 !. 1\t. !. 0i(i;:;:;! :;:!V計ぷ山パ1'付日ii崎;1可布市白i 必山�ωij. Figw'e 3: An example of spectral sequences of nan'owband speech， (a) converted speech by the ML using the GV ，(b) spectra of natural speech，(c) for a sentence fragment，"/ jy u: j i ts U sh 1 t e i k u r. [7]. 6. Conclusions We proposed bandwidth extension based on maximum likelihood estimation (MLE) with a Gaussian mixture model (GMM) considering dynamic features and the global variance (GV ). A result of the subjective evaluation demons甘ated that the speech quality of the narrowband speech is significantly improved by the proposed method. We plan to deal with a speaker independent model， online processing and noise ro・ bustness.. [8] [9]. 7.. Acknowledgments T his research was supported in part by e- Society pr句ect and KDDI collaborative research.. [1] B.Bessette， R. [2] [3]. References. [10]. Salami， R. Lefebvre， M. lelinek， 1. Potola-Pukkila， 1. Vainio， H. Mikkola and K. Jarvinen， uηle adaptive multirate wideband speech codec (AMR WB)，" IEEE ηαns.， Vol. 10，No. 8，pp. 62O-Q36，2002.. [11]. Y. Yoshida，and M. Abe，“An Algorithm to Reconstruct Wideband Speech From Narrowband Speech Based on Codebook Mapping，" Proc. ICSLP94， pp.1591-1593， 1994.. [12]. K.y. Park and H.S. Kirn. “Narrowband to wideband conversion of speech using GMM based transforma tion，" Proc. ICSLP， pp. 1847-1850， Istanbul， June， 2000.. [41L制iaru，?c?EI-- MO山::iiP2zm-[1 3 ] an5 }nn D r v01cp conv ，， Speech and Audio Processing， Vol. 6，No.2， pp. 131一 142，1998.. [5] M.L. Seltzer， A. Acero， and 1. Droppo， “Robust Bandwidth Extension of Noise-corrupted Narrowband Speech，" Proc. ICSLP， pp. 1509-1512，2005.. [6]. [14]. S. Yao and C.F.Chan，“Block-based Bandwidth Exten sion of Narrowband Speech Signal by using CDHMM，" Proc. ICASSP， pp. 1793-1796，2005.. - 286- 230 -. S.Y Yao and c. F. Chan， “ Speech bandwidth en hancement using state space speech dynamics，" Proc. ICASSP2006， pp. 1489-1492，2006. Y. Agiomyrgiannakis and Y. Stylianou，“Combined Es timafionjcoding of Highband Spectral Envelopes for Speech Spectrwn Expansion，" Proc. ICASSP2004， pp. 469-472，2004. T. Toda，A.W.Black，and K. Tokuda，“Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory，" IEEE Trans. ASLP， Vol. 15，No. 8，pp.2222-2234，2007. A. Kain and M.W.Macon.“ Spectral voice conversion for text-to・speech synthesis，" Proc. ICASSP， Seattle， U.S.A，. pp. 285-288，May 2004. H.Kawahara， J o Estill and O. Fujirnぽa，“Aperiodicity extraction and control using mixed mode excitation and. group de!ay manipulation for a high quality speech anal. ysis， modification and synthesis system STRAIGHT，" Proc.凡1AVEBA， Sep. 13-15， Firentze Italy，2001.. H.Kawahara，1. Masuda-Katsuse，and A. de Cheveigné， “Restructuring speech representations using a pitch-adaptive time-企equency smoothing and an instantaneous-合equency-based Fo extraction: Possi ble role of a repetitive 狩ucture in sounds，" Speech Commun.， Vol. 27，No. 3・4，pp. 187-207.1999.. 山間iムi郡山改叫;tfふh;;ぷ7. proc ICASSP，Voil ，pp.137 140， San Francisco，U SA， Mar.1992. T.V. Ramabadran， 1.P. Ashley and M.J. McLaughlin， “Background Noise Suppression for Speech Enhance ment and Coding九IEEE Workshop on Speech Coding and 1礼Pocono Manor，PA，pp.43-44，1997..

(5)