Simultaneous Conversion of Duration and Spectrum Based on Statistical Models Including Time-Sequence Matching
全文
(2) which co町esponds to the t(2)_ number of source s叫uence th frame of target sequence 0(2). Each elements of the com plete data like1ihood 紅e defined as follows:. 0(1). where m - [ ml,m2,.・1・. ,mT 1 is a mixture index se quenc怠 The conditional distribution can also be written as GMM, and its output probability distribution is presented as fol lows:. P(0�2) 1O�l),mt = i,入)= N(0�2);Ei(t),Di). T(1) P(ml入)=rr P(町(1) 1入) t(1)=1 ) p(O(I 1m,入) T(I) =H N(o;:i),μ::(1 ,2S:{1)) ). (4). where. Ei(t) =μ;2) + :E;2,'):E;" 1)ー1(oil)ーμ;')) Dz=2712)-2f,l)2r,1)-kzyJJ. (5) (6). Since the equation (3) includes latent variables, the optimal sequence of is estimated via the EM algorithm. The EM algorithm is an iterative method for approximating the maxi mum likelihood estimation. It maximizes the expectation of the complete data log-likelihood so called Q-function (auxiト iary function):. 0(2). P(α|入)=. "'. ・ w ={町11壬z三M}. the mixture weights of the Gル瓜1: which generate the source feature sequence where WiニP(mt 1) = il入) is the probability of i-th ( πuxture.. where. 叫耳T,F,. DJ1. 2二ìt(ゆJ1. M. i=l. D-1E. ,D-1E2' ,・. ,D-1ET. M. D-1Et. P(mt = i 1oi'),oi2),入). ( 13). tωo. 0(1). =. [0;い1り1),0. .C ={Cnl1壬η壬N} :. the甘初sition probabilities of the sequence matching whereCn indicates the probabiト ity P(内 2) - 削)ー +π|αρ) 1)' This parameter ( 1 co町民ponds to the cost function in the DP matching.. • B(2) - {b;2) 11 � i壬M} butions of the 回g巴t feature. P(O;2?2;}|O:出:日i),m町tμω{. s詑e-. the output dis凶-. 0(2).. a刷b凶1出lit均tザyoぱf t恥he巴叫et fea伽悶削tωtur問巴 児附ct伽or. where. b可ザ;?2}. =. O;��) given the cor. responding source feature vにtor O;�L at i-th mixture This c∞0叩n】ditional d必istribuωi比tiぬon is assum巴d to be a Gauω』おs-. T. [0問 O向. be a Gaω制山u凶附s鉛叩ssia幻sia悶ian叩n di芯s如帥trib加u凶山蜘t1加1ω航o. whe E 児叫ι d 4;?1?り) and :E;') are the mean vω伽and covariance P matrix, respectively.. In 出e DPG1仏-1:-based me出od, we define 白e like凶100d function P(0(1),O(2) 1入) including the s凶cture of se quence matching. The simultaneous optimization is per・ formed for DP matching and training of model parame ters based on the恥1L criterion. The advantage of th巴 DPG1仏1: iおS tωo directly r,陀epr悶es臼E叩nt t仰wo di仔ere叩nt 1巴叩ngt白h. …. th巴 output probabiト. 0(1), where b;') = P(O;�;) 1mt(1)= i,.>.) is tl叫帥abil町of source fea・ ture vector O(t �!, at i-th rnixture and which is assumed “ (1). 3. Spectral Conversion ßased on DPGMM. qu. :. 町 d凶出ltions of source feature. (11) (12). 0(1),. • B(1) ={b;') 11壬z三M}. ( 10). L ìt(i)Di1Et(i). i=l. γt(i). (9). .I. ( 19). The model p釘ameters of DPGMM are summarized as fol lows:. (8). ,DF11. (17). 「可 , 、 「 守 、T Ci=I ili Ci 卜 ο;:; )= | 1 0;:i) 1 |. which maximizes the Q-function is given by. D-1. P(at(2)lat(2)_I''>'). where. Taking the derivative of the Q-function, the spectral sequence. 。(2)=D-1 -1子lE ( ). rr. (16). P(0(2) 1 0(1),m,a,入) T(2) =H N(02);CmatJ. ( Q(O(仰2幻),O 問2幻) (ロ2)川ml仰 O 川 ヘリ び (1 =2乞二コ[P(町(ρOσ(向2幻)川ml仰川O( リ川川,.>.入) lnP(町 ο グ り1 ),'>' ,.>. 入)小 ド ] (7η). "" (2) 0'-'. T(2). ( 15). 2) 「 i T恥 刷i出ho∞od fu叩肌ωn恥削cはtion叩n of obs巴訂rO 7'? (2) I vation叫uences 0={O(I),0仰} is written as follows. s討i川a. i仏lλi and :Ei are the mean vector and the covariance ma 凶x, respectively.. P(OI入) = 乞[P(ml入)p(O(I) 1m,.>.). γn,a. Using shorthand notation, the model is defined as入 = {ω,C,B(l),B(2)}. Figure 1 shows the model structu問 in cluding time-sequence matching. The generative procedure is summarized as follows:. x P(α|入)P( 0(2) 10(1),m,a,入)] (14). where m= [ml,m2γ・・ ,mT(川IS a rruxtu問index sequence and its element mt(l ) means the mixture index of the obser vation at time t(1). The variable α= [α1,α2,'" ,αT(叶 represents the temporal matching between source and target fea ture叫uence and αρ )ε{1,. .. ,T(1)} indicates the fram巴. 1. A mixture index sequence m is determined according to the weight P(m 1入).. 0(1). 0(1). is generated from Gaus 2. A source feature sequence 1m,入). sian dist巾ution P(. 0(1). 《hU 司,. 噌BA. 1073.
(3) 。(2). d ; ) 同 心). 。(2). Target feature sequence. 0(2). 人一 AH. Duration models. Original feature sequence. 1. 2. Mlxtu時index 3 sequence. Figure. Figure 1: Model structure including time-sequence matching. 3.. The frame matching between mined according to. P(α|入).. 何回ctral∞nve間on). 0(1) and 0(2) is deter. 4. The target feature sequenω 0(2) is generated ac cording to the conditional Gaussian distribution given the sourc巴 feature sequence. P(0(2)10(1),m,a,入). oロ)111111111111 Duration models. _r'-.:. ノ \._. 0 (1). 1. Figure. 1.. r(l) M \ x I乞乞ボ)(i)ð(t(勺(2)):Ë;ICiÕ;;�) 1 (20) I. rn,a. 3.. 3: Duration conversion. Determine the mixture index sequence m and frame matching so as to maximize the posterior probability. a P(m,α10(1),0(2),入). 1.. 1.. 2.. and Spectrum To convert a speaking rate, we detìne duration models attached to each mixture of DPGMM. A duration of s-th segment is rep. N(d.1Vi, 'Þi). Vi. = 1,・・・ ,. Sfrommand. Estimate duration models for each mixture component using the corresponding duration vectors.. Determine the mixture index sequenc巴m which maxi・ mizes posterior probability given an in put feature sequence. P(m10(1),入). ([(1). from the mixture index se Extract source duration quence而and convert it into the t紅get duration us ing the following equation:. ([(2)= 1/;2). 4. Simultaneous Conversion of Duration. [d�1),d�2)] T which d�2).. d.,s. 3. P(α|λ). d.. 1. The simultaneous conversion of duration and spectrum is performed based on DPGMM with duration models. An overview of duration conversion is shown in Figure and th巴 procedure is summarized as follows:. Although the DPGMM can represent di仔erent length sequences of source組d target features, on巴 to one frame matching is as sumed in the conversion process (Eq. (20)), because the Marko vian transition pr伽bility is insufficient to convert dlト ratlOns.. 『E EEJ EE EE q‘ 。‘ 1包 2 ( (包 AV AV 司A 内,. ( 包 ( t AV AV FE EEL EE EE 一一 z 司El ll-J 12 U U FE EEL EE EE 一一 v包. cons凶s of source duration and t訂get duration The segment means a p巴riod in which the same mixture component continues. Duration models are represented by 2-dimensional Gaussian distributions with the mean v民tor and the covariance matrixφi and each component of these pa ram巴ters 紅巴 defined follows:. 2. 2. Generate duration vectors αobtained in step. where. α. 己今. N(Vl, 'Þ1). 頼明 2. 1 r(l) M \一 1 02)=l乞乞ポ)(i)ð(t(勺(2))広 I /. ð(・) is the Kronecker delta function: ð(u,v)= 1 if u= 肌ð(u,v)= 00伽Wl民and ,it;) denotes the expectation of a mixture index mt l) with respect to posterior distribution: ( P(mt(l)=i10(1),0(2),入) γit;)(i) = LP(m,α10(1),0(2),入)ð(mt(川) (21). 。(2). 官〈. ([(1)官官. 0(2). d�l). d(2 ) 官. N(V2, 'Þ2). The parameters of DPGMM can be estimated via varia tional EM algorithm [4]. In the conversion process, the con verted feature sequence can be obtained by maximizing a lower bound of the likelihood. The optimal sequence is given as the following equation:. resented by a jo川du則on vector. 2: Training of duration models. ム (2 . 1) + 告-;1 了(ð}1) -vy)) γz. ([(2). (23). 3. A matching sequence a is determined using duration J(1) and J(2). Frame matching within each segment is determined at even intervals. The voice conversion taking account of a spe心dng rate is per formed by converting sp巴ctrum based on the matching sequence ã which are obtained by the above procedure. The converted feature sequ巴nce is obtained as. I r(l) M \ ; 。 2)=lZZ7;JJ )( z )sk(2) ,t(1))EJl l \t(1)=1包=1 / M l) r I ( \ x I乞乞γ;JJ)(i)6( at(2) ,t(1))EJICzo;:i) |(24) \t(I)=1 i=1 /. as. φ. (22). Figure 2 shows an overview of training duration models and the procedure is summariz巴d as follows:. ー 7 pd1A 円i AU 'I.
(4) In the proposed method, each mixture component of DPGMM has di仔'erent transformation function of duration, therefore du rations are converted nonlinearly and dependently on spec汀al information.. 5. Experiments Voice conversion experiments on the ATR Japanes巴 speech database were conducted. Two male speakers were selected as a source and a target speaker (source:mtk target:myi). The t紅get speaker has a more rapid speaking rate than the source speaker Ten sentences uttered by the both speakers were used for train ing and 50 sentences were used for evaluation. The speech data were down-sampled from 20kHz to 16kHz, windowed at a 5・ms frame rate using a 25-ms Blackman window, and pararneterized into 24 mel-ceps佐al coefficients excepting the zero-th coeffi cients and their first order derivative were used as 出e dynarnic features. The number of mixtures are four Figure 4 shows 出e comparison of spectrum for a Japanese sentence “muzukashii" which is not included in the 汀ammg data. The notation “GMM" and “DPGMM" indicate the con ventional methods based on GMM and DPG:II.仏生respectively. “DURI" and “DUR2" mean the proposed methods with linear and nonlinear duration conversion, respectively. “DURI" uses only one linear transformation (Gaussian dis凶bution) and it is equivalent to a special case of “DUR2" in which the parame ters of duration models are shared among all mixture compo nents. From Figure 4, the speaking rate of the conventional methods (“GMM" and “DPGMM") are similar to that of the source speech. However, the converted spectrum of the pro posed methods (“DURI" and “DUR2") are more rapid than that of the source speech. Furthermore, although the speaking rate of “DUR l" was converted by a constant ratio,“DUR2" locally changed the spe北ing rate dependently on spectral information. 輔鵬醐醜轍欄 " 0O.OO r �. 2. 6. Conclusion This paper has proposed a simultaneous conversion method of duration and spectrum based on statistical models including ume-sequenαmatching. The proposed technique converts a speaking rate dependently on spec汀al information. In the ex periments, it is confirmed白紙the proposed method achieved a higher perfo口nance than the conventional GMM-based ap proaches. A simultaneous optimization of DPGMM and dura tion models will be a future work.. I. 寸 ù: 0.501. (b)“GMM". " O.OOr �. I. 十 寸' 広0.50. �. (c)“DPGMM". (d)“DUR1". O.oor. 豊0.25� ú: 0.50'. (e)“DUR2". Figure 4: Comparison of spectrum for a phrase "muzukashü". 95% confidence intervals一司. 3.4. 3.2 (/) 0 2 o 3.0. A DMOS (Di仔erential Mean Opinion Score) test was per formed for evaluating the sirnilarity between the target and con verted sp巴巴ch in spe必E巴r characteristics. The opinion score was set to a 5・point scale. Fifteen sentence were used for the evalu ation set, and the number of listeners was 15. Figure 5 shows the results of the DMOS test. Compar ing the proposed methods with duration conversion (“DURI" and "DUR2") and the conventional methods without duration conversion (“GMM" and “DPGMM"), the proposed methods are superior to the conventional methods. This means that the duration conversion is effective for improving the similarity in the converted speech. Furthermore, comparing “DURI" and “DUR2ヘ“DUR2" could obtain a higher score than “DURI." It is confirmed血at the nonlinear conversion using DPGJ.V酌1 can accurately convert durations because of the dependency on spectrum information.. (a) source speech. 2.8. 2.6. GMM. DPGMM. DUR1. DUR2. Figure 5: Result of DMOS test. 7. References [1] M. Abe, S. Nakarnura, K. Shikano, and H. Kuwabara, “Voice conversion through vector quantization," J. Acoust. Soc. Jpn., vol. 11, no. 2, pp.71-76, 1990. [2] Y. Stylianou, O. Cappe弘, and E. Moulines丸, p戸仰ro油baぬbilist旧1悶c transform f,伽Oぽr voice c∞onversl剖IOnω叩叫n凡lし," Proc. of lEEE Trans. Speech and Audio Processing, vol. 6, No. 2, pp. 131-142, 1998. [3] T. Toda, A.w. Black, and K. Tokuda, “Spectral conveト sion based on maximun likelihood estimation considering global variance of converted parameter," Proc. of ICASSP, vol. 1, pp.9-12, M訂 2005. [4] Y. Nankaku, K. Nakamura, T. Toda, and K. Tokuda,“Spec 汀al conversion based on statistical models including time sequence matching," Proc. of ISCA Speech Synthesis Work shop, pp 333-338,Aug. 2007. .. 仰m.
(5)
図
関連したドキュメント
Based on the stability theory of fractional-order differential equations, Routh-Hurwitz stability condition, and by using linear control, simpler controllers are designed to
VRP is an NP-hard problem [7]; heuristics and evolu- tionary algorithms are used to solve VRP. In this paper, mutation ant colony algorithm is used to solve MRVRP with
In the study of dynamic equations on time scales we deal with certain dynamic inequalities which provide explicit bounds on the unknown functions and their derivatives.. Most of
At the same time, a new multiplicative noise removal algorithm based on fourth-order PDE model is proposed for the restoration of noisy image.. To apply the proposed model for
These authors make the following objection to the classical Cahn-Hilliard theory: it does not seem to arise from an exact macroscopic description of microscopic models of
These authors make the following objection to the classical Cahn-Hilliard theory: it does not seem to arise from an exact macroscopic description of microscopic models of
In this paper, Zipf’s law, allometric scaling, and fractal relations will be integrated into the same framework based on hierarchy of cities, and, then, a model of playing cards will
based on variational methods established the existence of an unbounded sequence of weak solutions for a class of differential equations with p(x)-Laplacian and subject to