Simultaneous Conversion of Duration and Spectrum Based on Statistical Models Including Time-Sequence Matching

全文

(1)品獄入納ド v叫 f 開以訂的知 i a m 升、d " 討 m 4 C Simultaneous Conversion of Duration and Spectrum ßased on Statistical Models Including Time-Sequence Matching Kaori Yutani1，ゐsuke Utol， Yoshihiko Nankakul， Tomoki Toda2， Keiichi Tokuda1 1 Nagoya Institute of Technology， Gokiso・cho， Showa-ku， Nagoya， Aichi， 466-8555 Japan 2Nara Institute of Science and Technology， 8916 Takayama， Ikoma， Nara， 630-0101 Japan. {yutani， uto， na出aku， ri， tokuda}@sp.叫tech.ac.jp， [email protected]エst. jp. 3. Abstract. The paper is org加ized as follows. Section 2 and Section explains the conventionalvoice conversion technique bas巴d on GMM and DPGMM， respectively. A method of duration and spectrum conversion is presented in Section 4 and experimental results are repo口ed in Section 5. FinaJly， concJusions and future works are given in Section 6. This paper describes a simultaneous conversion technique of duration and spec汀um based on a statistical model incJud ing time-sequence matching. Conventional GMM-based ap proaches cannot perform spectral conversion taking account of spe心ång rate because it assumes one to on巴 frame matching between source and target features. However， speaker charac teristics may appear in speaking rates. In order to perform dura tion conversion， we attach duration models to statistical models incJuding time-sequence matching (DPGI\仏1). Since DPGMM can represent two different length sequences directly， the con version of spec汀um and duration can be performed within an integrated framework. In the proposed technique， each mixture component of DPG恥仏1 has di仔erent duration transformation functions，出巴refore durations are converted nonlinearly and de pendently on spectral information. In the subjective DMOS test， the proposed method is superior to the conventional method. Index Terms: voice conversion， GMM， duration conversion. 2. Spectral Conversion ßased on GMM To conv出spec回 1 feature sequences of a source speaker to that of a target speaker， the joint probability of two features are. r � � Ot =l ol') ' ，Ol2) ， I T source one Ol') and th巴 tar. be a joint feature vector of the get one Ol2) at time t， where denotes transposition of the vector. An alignment between two feature sequences is obtained by the Dynamic Programing (DP) matching. In the GMM-b凶ase吋d voice c∞O叩nv刊巴rSI剖lon， t山he vector s臼equ巴叩nc巴 0 = r T i時M…sm巾mo叫O吋dele附巴 'ρOT. T. μ. r :El"1). 'ι - l. Z. �. O. μ(1) = |l ム 12 ) 1J. N ω. μi. M Z同. T 日出. 入. P. 。. where. :El，，2). :E12， 1) :E12，2). 1. (2). J. and M means the number of mixtures，町二P(i 1入) is the mixture weight of the i-th component， μi and :Ei are the mean vector and covariance matrix， respectively， These model pa rameters are estimated via the Expectation Maximization (EM) algorithm，. 2.1. Maximum LikeJihood Spectral Conversion In the. [. the maximum likelihood optimal converted feature. T. O叫吋;「?2幻) ， β刈O吋F )\. q午中脚仰u閃附I児朋E白I悶. 0(1) =. ?l l T. ，β刈O叫F). [0叫;?1) T\'ρ刈O叫叫;rγ1川1)T円T. spec町al sequence. 代悶E. 向. g. a sou. conversion，. 0(2). 臼. たb刷a瓜tur陀E. β刈O叫�)Tr 時山ωO油btain吋. ma似制X別IITIlZ刀zm昭g the f，お011恥low叩m昭gc∞0叩nd】吋凶di山tIo叩nal dis抗tribution:. p(0(2) 10(1) ，入) =. Fhd 今ノM 77 nU旬I. Copyright @ 2008 ISCA Accepted after peer review of full paper. 寸. [片OlJT，ρO2λ. tion between sωou町rcαe釦d targ伊et f，化E悶a剖tu町re凶s. The outゆpu凶t p戸mbability oぱfO g♂IV刊e叩n Gl\仏4入can b巴 written as follows:. 1. Introduction Voice conversion is a tech凶que for conve口ing a certain speaker's voice into another speaker's voice. It can modify speech characteristics using conversion rules statistically ex tracted from a small amount of data [1]. One of typical spec tral conversion fram巴works is based on a Gaussian Mixture Model (GMM) [2]目This method realizes the continuous map ping based on the soft cJustering. A more accurate formulation of spectral conversion based on ML (Maximum Likelihood) cri terion has been presented [3]. The ML-based conversion is a sophisticated technique because all processes in the algorithm is derived based on the single0句ective function. In this conventional GMM-based method， GMMs are trained under an assumption that source and target feature se quences have the same length， because GMMs are trained us ing joint feature vectors which are references of mapping rules， and the Dynamic Programming (DP) matching between source and target feature sequences is conducted prior to the 汀ammg of GMMs. Because of this， it cannot take account of the co汀b lation of duration between source and target features. To over come this problem， we apply statistical models incJuding time sequence matching (DPGMM) [4]. The likelihood function of this model can directly deal with two di仔erent length sequences， in which a台ame alignment between two sequences is repre sented by discrete hidden variables. It can perform modeling of duration correlations between source and target features. In the proposed voice conversion technique， we can convert a speak ing rate nonlinearly and dependently on sp巴ctral information by attaching duration models to each mixture of DPGMM.. 、. modeled by GMM [3]. Let a vector. 計. 同10(1 ) 入). 在. (0;2) 10;川町入). ](3). September 22 - 26， Brisbane Australia.

(2) which co町esponds to the t(2)_ number of source s叫uence th frame of target sequence 0(2). Each elements of the com plete data like1ihood 紅e defined as follows:. 0(1). where m - [ ml，m2，.・1・. ，mT 1 is a mixture index se quenc怠 The conditional distribution can also be written as GMM， and its output probability distribution is presented as fol lows:. P(0�2) 1O�l)，mt = i，入)= N(0�2);Ei(t)，Di). T(1) P(ml入)=rr P(町(1) 1入) t(1)=1 ) p(O(I 1m，入) T(I) =H N(o;:i)，μ::(1 ，2S:{1)) ). (4). where. Ei(t) =μ;2) + :E;2，'):E;" 1)ー1(oil)ーμ;')) Dz=2712)-2f，l)2r，1)-kzyJJ. (5) (6). Since the equation (3) includes latent variables， the optimal sequence of is estimated via the EM algorithm. The EM algorithm is an iterative method for approximating the maxi mum likelihood estimation. It maximizes the expectation of the complete data log-likelihood so called Q-function (auxiト iary function):. 0(2). P(α|入)=. "'. ・ w ={町11壬z三M}. the mixture weights of the Gル瓜1: which generate the source feature sequence where WiニP(mt 1) = il入) is the probability of i-th ( πuxture.. where. 叫耳T，F，. DJ1. 2二ìt(ゆJ1. M. i=l. D-1E. ，D-1E2' ，・. ，D-1ET. M. D-1Et. P(mt = i 1oi')，oi2)，入). ( 13). tωo. 0(1). =. [0;い1り1)，0. .C ={Cnl1壬η壬N} :. the甘初sition probabilities of the sequence matching whereCn indicates the probabiト ity P(内 2) - 削)ー +π|αρ) 1)' This parameter ( 1 co町民ponds to the cost function in the DP matching.. • B(2) - {b;2) 11 � i壬M} butions of the 回g巴t feature. P(O;2?2;}|O:出:日i)，m町tμω{. s詑e-. the output dis凶-. 0(2).. a刷b凶1出lit均tザyoぱf t恥he巴叫et fea伽悶削tωtur問巴児附ct伽or. where. b可ザ;?2}. =. O;��) given the cor. responding source feature vにtor O;�L at i-th mixture This c∞0叩n】ditional d必istribuωi比tiぬon is assum巴d to be a Gauω』おs-. T. [0問 O向. be a Gaω制山u凶附s鉛叩ssia幻sia悶ian叩n di芯s如帥trib加u凶山蜘t1加1ω航o. whe E 児叫ι d 4;?1?り) and :E;') are the mean vω伽and covariance P matrix， respectively.. In 出e DPG1仏-1:-based me出od， we define 白e like凶100d function P(0(1)，O(2) 1入) including the s凶cture of se quence matching. The simultaneous optimization is per・ formed for DP matching and training of model parame ters based on the恥1L criterion. The advantage of th巴 DPG1仏1: iおS tωo directly r，陀epr悶es臼E叩nt t仰wo di仔ere叩nt 1巴叩ngt白h. …. th巴 output probabiト. 0(1)， where b;') = P(O;�;) 1mt(1)= i，.>.) is tl叫帥abil町of source fea・ ture vector O(t �!， at i-th rnixture and which is assumed “ (1). 3. Spectral Conversion ßased on DPGMM. qu. :. 町 d凶出ltions of source feature. (11) (12). 0(1)，. • B(1) ={b;') 11壬z三M}. ( 10). L ìt(i)Di1Et(i). i=l. γt(i). (9). .I. ( 19). The model p釘ameters of DPGMM are summarized as fol lows:. (8). ，DF11. (17). 「可，、「守、T Ci=I ili Ci 卜 ο;:; )= | 1 0;:i) 1 |. which maximizes the Q-function is given by. D-1. P(at(2)lat(2)_I''>'). where. Taking the derivative of the Q-function， the spectral sequence. 。(2)=D-1 -1子lE ( ). rr. (16). P(0(2) 1 0(1)，m，a，入) T(2) =H N(02);CmatJ. ( Q(O(仰2幻)，O 問2幻) (ロ2)川ml仰 O 川ヘリび (1 =2乞二コ[P(町(ρOσ(向2幻)川ml仰川O( リ川川，.>.入) lnP(町 ο グり1 )，'>' ，.>. 入)小ド ] (7η). "" (2) 0'-'. T(2). ( 15). 2) 「 i T恥刷i出ho∞od fu叩肌ωn恥削cはtion叩n of obs巴訂rO 7'? (2) I vation叫uences 0={O(I)，0仰} is written as follows. s討i川a. i仏lλi and :Ei are the mean vector and the covariance ma 凶x， respectively.. P(OI入) = 乞[P(ml入)p(O(I) 1m，.>.). γn，a. Using shorthand notation， the model is defined as入 = {ω，C，B(l)，B(2)}. Figure 1 shows the model structu問 in cluding time-sequence matching. The generative procedure is summarized as follows:. x P(α|入)P( 0(2) 10(1)，m，a，入)] (14). where m= [ml，m2γ・・，mT(川IS a rruxtu問index sequence and its element mt(l ) means the mixture index of the obser vation at time t(1). The variable α= [α1，α2，'" ，αT(叶 represents the temporal matching between source and target fea ture叫uence and αρ )ε{1，. .. ，T(1)} indicates the fram巴. 1. A mixture index sequence m is determined according to the weight P(m 1入).. 0(1). 0(1). is generated from Gaus 2. A source feature sequence 1m，入). sian dist巾ution P(. 0(1). 《hU 司，. 噌BA. 1073.

(3) 。(2). d ; ) 同心). 。(2). Target feature sequence. 0(2). 人一 AH. Duration models. Original feature sequence. 1. 2. Mlxtu時index 3 sequence. Figure. Figure 1: Model structure including time-sequence matching. 3.. The frame matching between mined according to. P(α|入).. 何回ctral∞nve間on). 0(1) and 0(2) is deter. 4. The target feature sequenω 0(2) is generated ac cording to the conditional Gaussian distribution given the sourc巴 feature sequence. P(0(2)10(1)，m，a，入). oロ)111111111111 Duration models. _r'-.:. ノ \._. 0 (1). 1. Figure. 1.. r(l) M \ x I乞乞ボ)(i)ð(t(勺(2)):Ë;ICiÕ;;�) 1 (20) I. rn，a. 3.. 3: Duration conversion. Determine the mixture index sequence m and frame matching so as to maximize the posterior probability. a P(m，α10(1)，0(2)，入). 1.. 1.. 2.. and Spectrum To convert a speaking rate， we detìne duration models attached to each mixture of DPGMM. A duration of s-th segment is rep. N(d.1Vi， 'Þi). Vi. = 1，・・・，. Sfrommand. Estimate duration models for each mixture component using the corresponding duration vectors.. Determine the mixture index sequenc巴m which maxi・ mizes posterior probability given an in put feature sequence. P(m10(1)，入). ([(1). from the mixture index se Extract source duration quence而and convert it into the t紅get duration us ing the following equation:. ([(2)= 1/;2). 4. Simultaneous Conversion of Duration. [d�1)，d�2)] T which d�2).. d.，s. 3. P(α|λ). d.. 1. The simultaneous conversion of duration and spectrum is performed based on DPGMM with duration models. An overview of duration conversion is shown in Figure and th巴 procedure is summarized as follows:. Although the DPGMM can represent di仔erent length sequences of source組d target features， on巴 to one frame matching is as sumed in the conversion process (Eq. (20))， because the Marko vian transition pr伽bility is insufficient to convert dlト ratlOns.. 『E EEJ EE EE q‘ 。‘ 1包 2 ( (包 AV AV 司A 内，. ( 包 ( t AV AV FE EEL EE EE 一一 z 司El ll-J 12 U U FE EEL EE EE 一一 v包. cons凶s of source duration and t訂get duration The segment means a p巴riod in which the same mixture component continues. Duration models are represented by 2-dimensional Gaussian distributions with the mean v民tor and the covariance matrixφi and each component of these pa ram巴ters 紅巴 defined follows:. 2. 2. Generate duration vectors αobtained in step. where. α. 己今. N(Vl， 'Þ1). 頼明 2. 1 r(l) M \一 1 02)=l乞乞ポ)(i)ð(t(勺(2))広 I /. ð(・) is the Kronecker delta function: ð(u，v)= 1 if u= 肌ð(u，v)= 00伽Wl民and ，it;) denotes the expectation of a mixture index mt l) with respect to posterior distribution: ( P(mt(l)=i10(1)，0(2)，入) γit;)(i) = LP(m，α10(1)，0(2)，入)ð(mt(川) (21). 。(2). 官〈. ([(1)官官. 0(2). d�l). d(2 ) 官. N(V2， 'Þ2). The parameters of DPGMM can be estimated via varia tional EM algorithm [4]. In the conversion process， the con verted feature sequence can be obtained by maximizing a lower bound of the likelihood. The optimal sequence is given as the following equation:. resented by a jo川du則on vector. 2: Training of duration models. ム (2 . 1) + 告-;1 了(ð}1) -vy)) γz. ([(2). (23). 3. A matching sequence a is determined using duration J(1) and J(2). Frame matching within each segment is determined at even intervals. The voice conversion taking account of a spe心dng rate is per formed by converting sp巴ctrum based on the matching sequence ã which are obtained by the above procedure. The converted feature sequ巴nce is obtained as. I r(l) M \ ; 。 2)=lZZ7;JJ )( z )sk(2) ，t(1))EJl l \t(1)=1包=1 / M l) r I ( \ x I乞乞γ;JJ)(i)6( at(2) ，t(1))EJICzo;:i) |(24) \t(I)=1 i=1 /. as. φ. (22). Figure 2 shows an overview of training duration models and the procedure is summariz巴d as follows:. ー 7 pd1A 円i AU 'I.

(4) In the proposed method， each mixture component of DPGMM has di仔'erent transformation function of duration， therefore du rations are converted nonlinearly and dependently on spec汀al information.. 5. Experiments Voice conversion experiments on the ATR Japanes巴 speech database were conducted. Two male speakers were selected as a source and a target speaker (source:mtk target:myi). The t紅get speaker has a more rapid speaking rate than the source speaker Ten sentences uttered by the both speakers were used for train ing and 50 sentences were used for evaluation. The speech data were down-sampled from 20kHz to 16kHz， windowed at a 5・ms frame rate using a 25-ms Blackman window， and pararneterized into 24 mel-ceps佐al coefficients excepting the zero-th coeffi cients and their first order derivative were used as 出e dynarnic features. The number of mixtures are four Figure 4 shows 出e comparison of spectrum for a Japanese sentence “muzukashii" which is not included in the 汀ammg data. The notation “GMM" and “DPGMM" indicate the con ventional methods based on GMM and DPG:II.仏生respectively. “DURI" and “DUR2" mean the proposed methods with linear and nonlinear duration conversion， respectively. “DURI" uses only one linear transformation (Gaussian dis凶bution) and it is equivalent to a special case of “DUR2" in which the parame ters of duration models are shared among all mixture compo nents. From Figure 4， the speaking rate of the conventional methods (“GMM" and “DPGMM") are similar to that of the source speech. However， the converted spectrum of the pro posed methods (“DURI" and “DUR2") are more rapid than that of the source speech. Furthermore， although the speaking rate of “DUR l" was converted by a constant ratio，“DUR2" locally changed the spe北ing rate dependently on spectral information. 輔鵬醐醜轍欄 " 0O.OO r �. 2. 6. Conclusion This paper has proposed a simultaneous conversion method of duration and spectrum based on statistical models including ume-sequenαmatching. The proposed technique converts a speaking rate dependently on spec汀al information. In the ex periments， it is confirmed白紙the proposed method achieved a higher perfo口nance than the conventional GMM-based ap proaches. A simultaneous optimization of DPGMM and dura tion models will be a future work.. I. 寸 ù: 0.501. (b)“GMM". " O.OOr �. I. 十寸' 広0.50. �. (c)“DPGMM". (d)“DUR1". O.oor. 豊0.25� ú: 0.50'. (e)“DUR2". Figure 4: Comparison of spectrum for a phrase "muzukashü". 95% confidence intervals一司. 3.4. 3.2 (/) 0 2 o 3.0. A DMOS (Di仔erential Mean Opinion Score) test was per formed for evaluating the sirnilarity between the target and con verted sp巴巴ch in spe必E巴r characteristics. The opinion score was set to a 5・point scale. Fifteen sentence were used for the evalu ation set， and the number of listeners was 15. Figure 5 shows the results of the DMOS test. Compar ing the proposed methods with duration conversion (“DURI" and "DUR2") and the conventional methods without duration conversion (“GMM" and “DPGMM")， the proposed methods are superior to the conventional methods. This means that the duration conversion is effective for improving the similarity in the converted speech. Furthermore， comparing “DURI" and “DUR2ヘ“DUR2" could obtain a higher score than “DURI." It is confirmed血at the nonlinear conversion using DPGJ.V酌1 can accurately convert durations because of the dependency on spectrum information.. (a) source speech. 2.8. 2.6. GMM. DPGMM. DUR1. DUR2. Figure 5: Result of DMOS test. 7. References [1] M. Abe， S. Nakarnura， K. Shikano， and H. Kuwabara， “Voice conversion through vector quantization，" J. Acoust. Soc. Jpn.， vol. 11， no. 2， pp.71-76， 1990. [2] Y. Stylianou， O. Cappe弘， and E. Moulines丸， p戸仰ro油baぬbilist旧1悶c transform f，伽Oぽr voice c∞onversl剖IOnω叩叫n凡lし，" Proc. of lEEE Trans. Speech and Audio Processing， vol. 6， No. 2， pp. 131-142， 1998. [3] T. Toda， A.w. Black， and K. Tokuda， “Spectral conveト sion based on maximun likelihood estimation considering global variance of converted parameter，" Proc. of ICASSP， vol. 1， pp.9-12， M訂 2005. [4] Y. Nankaku， K. Nakamura， T. Toda， and K. Tokuda，“Spec 汀al conversion based on statistical models including time sequence matching，" Proc. of ISCA Speech Synthesis Work shop， pp 333-338，Aug. 2007. .. 仰m.

(5)