A Decision Tree-Based Clustering Approach to State Definition in an Excitation Modeling Framework for HMM-Based Speech Synthesis
全文
(2) generated from Hl\!l\'ls. FO. Hl\!l\'l state sequence (81,. ., s:;-). Residual (target signal) e(n). (a). 「'f'f'↑αz. p,. p,. P3. an appropriate size (MDL factor control). Besides, in a mor巴 general sense, the approach itself of making use of mel-cepstral coefficient likelihood to c1uster residual segments is at best a rough approximation for filter state de白mtlOn. To alleviate the drawbacks of the phonetic trees method, an algorithm to merge terminal nodes of th巴 usual decis】on trees for mel-cepstral co巴fficients using ML of r巴sidual segments has been proposed in [7]. This hybrid approach so-defined bottom up clustering when compared with the utilization of phonetic decision-trees presents the advantages of: (1) using residual ML to obtain the final filter c1usters; (2) no need to design speci自c question sets for c1ustering. In this cas巴, likelihood increm巴nt or number of clusters can be used as stopping criterion. 4. State definition by top-down c1ustering. Both procedures for state definition described in S巴ction 3 rely on the use of trees for mel-cepstral coefficients; on巴 complet巴Iy and the other partially. This section describes the proposed al gorithm which is entir巴Iy based on residual ML.. pz. White noise w(n) (error signal). 4.1. Clustering criterion: residual ML. (b). A則ming that the noise sequence w(n) which is output by filter G(z) in Figure I(b) is a Gaussian process, the log Iikelihood of the signal u(n) (also a Gaussian process) is given byl. Figure !: The assumed excitation modelingframework: (a) syn rhesis part; (b) training part.. 3. State definition for the excitation model. 断111|Hui=-3ωπ+jl叩TG|juVGu, (4) wh巴re N is the number of samples of the whole database and u = [u(O) ... u(N - l)]T, (5) (6) G = [g(O) ... g(N-1)], . ー (m) _ ト fpー-.---' . . Q寺中 平 g ヒー_g I . (7) ー 1m 一 一 一 tcrms N-m-l te口nsJ. u. From Figure ! 山an be noticed that filters H" (z) and H (z) are associated with each HMM state position. The entire process spanning from state determination to excitation model training can be enumerated through th巴 following steps:. 1. create/define states for the 巴xcitation model;. 2. quantize (classify) residual segments according to the defined states;. 、 T. 3. caIculate filter for each cluster of residual segments us ing the procedure described in [1] to achiev巴 the finaJ. The second term in th巴 right side of (4) can b巴 wntten as. excitation model.. _ N-1. 1. L. η=0. 1. 1=1. 12. 31og|GTG|= jZ叫1- 乞g)e川 (l I 1 - NlogK,. This paper focuses on the first two steps enumerated above. In the next sections two existing methods to perform this task aside from the proposed algorithm are outlin巴d. (8). and b巴cause G(Z) is minimum-phase, the first term in the right side of (8) is zero [8]. Further, if w(n) =去ω(n) is a wh巾 noise sequence with variance one and mean zero, and > > L, th巴 third t巴口n in the right side of (4) can be approximated as. 3.1. The phonetic decision-trees method. N. In the phonetic decision trees method for state assignment, 創 ter 山tes for the excitation mod白制巴el, {SIし,.. .叶, ss , with S be凶E引m t出h巴 numberoぱfs幻ta剖te白s, are regarded as terrninal nodes of decision trees constructed for the spectrum stream of the HMM-based synth巴sizer in which the excitation model is appli巴d to. The idea of utilizing trees for spectrum relies on the assumption that residual sequences are highly correlated with the spectral pa rameters from which they are d巴rived by inverse創tering. Based on empirical approaches, the b巴st tr巴es are the on巴s constructed using solely phonetic questions. In addition, the minimum de scnpt!O日length (MDL) factor, used to control the size of the trees, is set so as ju吋ross phonetic information, such as voiced, unvoiced, fricative, stops, etc, can be classified by the trees.. }. -juVGド-iK2(N+L仰2(η)}勾ード2N 間 Therefore, the Iikelihood of e( n) given the exc出tion mode12 is simply a function of the unvoiced filter gain component K, logP[elI山"tJ =十og27r - N (IOgK +手) (10). 4.2.. Clustering procedure. By taking into account the state-dep巴ndency of the filter coeffi cients, ( 10) can be re-written as. 3.2. The bottom-up clustering approach. ... s. 同P[eIHv, Hu,t] =ーす同2π+2ンj,. The use of phonetic decision trees for state definítion, as de scrib巴d in Section 3.1 presents some drawbacks. The first one consists in the fact that it is necessary to design sp巴cific phon巴tic questions to cluster mel-cepstral coefficient distributions, as well as some supervision to check whether the tree has achieved. (11). 1 ThrOl凹l昭hout this paper bold upper an� lower case letters represent matrices and vectors,問spect附ly, and [十-]1 me悶a叩ns t汀rans叩po山s臼“lt削t 2N呼ot包et出ha剖tP[u(n吋)IHuバ(z刈z)] {特キ P[ド何e可(nπ)川IH匂(z吟),Huバ(z吟), tκ(n吋)]. 円ペυ 市14 円,臼. 1784.
(3) where 乙J -一川. や. og Kj十. 手). ,. 10 cluster one HMM state position for the excitαtion model shown in Figure 1.. τable 1: !ter,αtive part of the algorithm. (12). is the lik巴lihood of e(η) under state 8j, Nj is its co汀'espond ing numb巴r of samples, Kj is the co汀esponding unvoiced自Iter gain, and 8 is the number of states (or clusters, assuming that we are dealing with tied states). From (12) one can see that the smaller the gain factor Kj is, the larg巴r will be the contri bution of cluster 8j to the overall likelihood, weighted by the number of samples of the clust巴r. In fact, considering voiced regions, a small Kj means that the power of the unvoiced ex citation u(吋= e(π)-v(吋of segments belonging to cluster 8j is small, which implies that the Hv(z) outputs a河口al v(η) which is clos巴 to the target e(n) in Figure I(b). To visualize the way to calculate乙j Jt IS necessary to con sider the block diagram of Figure l(b). Initially voiced filter coefficients are computed, followed by the determination of the unvoiced excitation component u(n), finally leading to the gain component Kj. The process of splitting on巴 cluster into two can be thus sketched as follows: 1. split. 8j into 8i1. 1) For each cl uster 8jε{81,. . . ,8S} and each question qiε{ql, •・. ,qQ} (Q is the number of questions). 1.1) SpJit 8j into 8iJ and 812 according to question q i 1.2) Calculat巴 voiced filters hi1 and h12 using (14) 1.3) Calculate unvoiced filter gain components Kjl and Kj2 from (18) 1.4) Calculate乙iJ and乙j2 according to (12) 2) Select the cluster 8j and question q? that result in the largest likelihood increment given by (13) 3) Make, {...,8j,..., 8S}→{...,sjl' sj21"" SS+l}, {,., ,q?-I' q? ,q?+I"" ,qQ}→{ー・,Qf-l1 Qf+l1'・・,qQ-l} 4) If stopping criterion is not fulfilled go to Step 1 5) Stop with. Tjz(l) =. and 812 given a candidate question;. Ui(η) = α(n). 3. comput巴 unvoiced filter coefficients with co汀esponding gain components, gji' Kjl' g12' and K12'間spectively for 8iJ and 812・ After caJculating乙iJ and乙12 from KiJ and K12' respectively, using (12), likelihood incr巴ment due to the split can be given by. Ai. ー(m). ei. = [t�O) ... t�M)]. 19 ・ ・ O. ti(O)・・・ti(Ni-1) _ 同� I rr瓦 tcロns. -., - ,. 州 、 -一 � = I1| � 亭 temS. -.,-. .. -,. Stopping crit巴rion can be set to a minimum likelihood incre m巴nt threshold. However, the same criterion usually employed in the clustering process during the t印刷ng ofHMM-based syn thesizers, namely the MDL criterion, can also be applied here. The advantage in this case is that the size of the trees can be sys tematically controlled based on the trade-off between likelihood mc問団ent and model complexity [9]. Let the description length of excitation model入i with clus ter 副 {8 , . . ,8S;}, wher巴 8i is the number of clusters, each one of them having a voiced and an unvoiced filter, be given by. 1.. ふ: ι+ '(M Ez=E log hー>. (14). 巴i(川 一 1) 己�r � I 4F町ms I. +L+ 2)8. (16). ムe=ム+1 - ei=. (17). + L +2 -.cinc + logN. 一一万一一一. M. The clustering process is stopped ifムe>. O.. 5. Experiments. OO. T. の L 乞同 日U T 一一 民. with ti(η) and ei(η) bei時respectively pulse train and res凶lal segments with Ni samples belonging to cluster 8jz' Segments are obtained by alignment perfo口ned at the HMM state level. Afler that, the gain Kiz is calculated from. 勺og N.. 2. (20). The first and second terms in the right side of (20) correspond to the lik巴lihood of入i wh巴reas the third te口n measures Jts com plexity [9]. After each split gtep, the di仔巴rence of description length between the model after the split, 入包+ 1, and the model before the split,入i, is. (15). 0 ・・・ Q.1T � . 九1-η1. tenns II. -. 4.5. Stopping criterion. ー. where. (19). =. =. ζAfe. .. Decision汀ee construction for oneHM恥-1 state position starts by grouping all residual segments into a single cluster 81 (8 1). After that, split iterations are carried out as shown in Table 1. (13). The determination of voiced filters and unvoiced filter gain com ponents for {8fx Ix 1,2} implies optimization of filter coef fici巴nts and pulse trains for the new clusters, according to the algorithm described in [1]. In order to decrease computational complexity this iterative optimization is replac巴d by single cal culation of voiced filters followed by linear prediction analysis of the unvoic巴d exc山tion signal 叫η) under segments belo時ー ing to 8jz to derive the gain component Kfx. Assuming the diagram of Figure 1(b), voic巴:d filter coeffi・ cients for cluster 8iz can be obtain巴d by using the least squares formulation, i.e,.. ( ベ. =O,.. , L. I. n=口. 4.4. Algorithm for decision tree construction. 4.3. Approximations to decrease computational complexity. h3= 三. ìESjx. being th巴 sum of autocorre1ation sequences of all segments of hjz(η) * ti(n), where iεSjx・The unvoiced fi1ter coefficients of cluster 8 jz' {gjz(1),... ,giz(L)}, are de termined from T以1) by linear prediction using the Levinsor卜 Durbin algorithm [8].. 2. calculate voiced filter coef:白cients, hjl and h12, for the new clust巴rs 8jl and 8i1' respectively;. 乙mc -乙aft巴r一乙before =乙iJ+乙12ー乙3・. N.-l. L L ui(n)ui(n-l),. (21). The ATR503 database recorded by a female Japanese speaker was utilized to test the proposed clustering algorithm. An HMM-bas巴d synthesizer was trained for this database. Aside from Fo, mel-cepstral coefficients as described in [6] formed the multi-stream observation vectors.. 4 汚 4 tn,白 噌la a 吋/ ー.
(4) 、企Vooむ副主ho旨〉. ー竺enceー ) 閣 毛竺副 モ � ,.. 附 V“ rasd-orhbqi豆長ぬ�hones_�� 処白古川 田岡 崎 四 網 開 問岡. Table 2: Number of lerminal nodes al Ihe end of Ihe cluslering. モー. pro叩Ss. Number offull conlexl labels is 29395.. HMM state No. terminal nodes. I. 1. I. 2. I. 3. I. 4. :iD,主坦主丞'"心 間. ,... �画型竺I;�白 雪空竺>>色" 間V目. @五長主主o-�包 固臼. 的. ,.‘. �,.‘. 囲自匂竺�自闘自 回自. Figure 3: Decision Iree generaledfor Ihefirsl HMM slale. The 10. 15. 20 Splits岡. 25. 30. lerms“C-"and “L- " mean respeclively currenl and left conlext. 7きrminal nodes are represel1led by Ihe yellow reclangles.. 35. Figure 2・乙inc for each split山ralionfor all HMM slales. 5.1. Applying the proposed algorithm to the speech data Conditions. 、. 5.1.2. Resulls. { E ) Z. Residual signals were extracted from the speech database by inverse filtering using the same mel-cepstral coefficients em ployed to train the HMM synthesizer. Full context models were used to segment the residual signals at the HMM state level Pulse trains were d巴rived from pitch marks and eventually op timized for the residual segments through the procedure de scribed in [1]. Residual segments were then clustered using the algorithm described in Table 1. The MDL criterion as described in Section 4.5 was used to stop tree growth.. j i,l ! ll目L i ! ‘. 5.1.1.. 弘、. 200. ,. 、. / /' 40 20. 60. 80. Cluster. n. Figure 2 shows the evolution of乙inc given by (13) for all HMM state positions along the split steps. Table 2 displays the number of terminal nodes at the end of the process. A total of S 97 clusters were created. It can be noticed that the central states are the ones clustered with more details. This is expected since the central states include more segments. Figure 3 depicts the de cision tree constructed for the first HMM state. The interesting point to emphasize here is that although a question set related to full context features was used by the clustering algorithm, ques・ tions related to cu汀ent phone and its left context were the ones mostly selected during the process. The same result was also observed for the other states. This perhaps might explain why the phonetic tre巴s approach of Section 3.1 has been very 巴仔ec tive for residual modeling. One can also see in Figur巴 3 that the voiced branch is clustered with more details. This property was more clear for the central states. This larger granularity of the decision trees for voic巴d sounds was also expected.. Figure 4: lmpulse responses offilters Hv(z) derived using Ihe slale conβguralion yielded by Ihe proposed algorithm. =. 6. Conclusion. The proposed algorithm has shown to be effective for clustering residual 5ignals under the assumed excitation model, eliminat ing the rough approximation that has been applied 50 far with the use of trees for meトcepstral coefficients. Although自Iters are computed during the clustering process, the algorithm works just as a state definer. An unified method for state clust巴ring and filter optimization is in future plans.. 7. References. [1] R.M血a,T. Toda, H. Zen, Y. Nankaku. and K. Tokuda.“'An excitation model for HMM-based spcech synthesis based on residual modeling,'・in SSW6, 2∞7. [2]. l3J. 5.2. Excitation modeling. O. Abdel-Hamid, S. Abdou, and M. Rashwan,“Improving the Arabic HMM based speech synthesis quality." in ICSLP. 2∞6 J. Cabral. S. Renals. K. Richmond. and J. Yamagishi, .‘Towards an improved. modeling of the glottal source in statisticaI parametric spcech synthcsis," in SSW6.2∞7. The state configuration created by the clustering process was utilized to train an excitation model for the HMM-based syn thesizer. Voiced and unvoiced filters were det巴rmined for each terminal node of the constructed trees through the algorithm de scribed in [1]. Figure 4 shows the impulse responses of the finaI voice filters. It can be noticed that convergence was achieved for most of the filters. This shows that the proposed algorithm was successful in clustering residual segments based on the context determined by the questions. On the other hand, a few visi ble samples of noisy impulse responses in Figure 4 infer that bad segmentation andlor pitch marking mistakes may have con tributed to that. However, segmentation and pitch marking is sues li巴 beyond the scope of this paper.. [4]. T.Raitio, A.Suni. H. P ulakka, M.Vainio, and P. Alku,“HMM-bascd Finnish text-to-spccch system using glottal invcrse tiltering." in lnterspeech, 2008. [5]. H. Kawahara, J. Estill, and O. Fujimura,“Apenodicity cxtraction and con trol using mixed mode excitation and group delay manipulation for a high quality sp田ch analysis, modification and synthesis system STRAIGHT," in MAVEBA.2ool. [6]. H. Zcn, T. Toda,M. Nakamura, and K. Tokuda,“Details of thc Nitech HMM based spcech synthesis for Blizzard Challcnge 2∞5," IEICE Trans. on lnf and Systems, vol. E90-D, Jan. 2007. [7]. R.M副a,T. Toda, K.Tokuda, S.Sakai. and S.N依amura,“On the statc for an excitation model in HMM-based spcech synthesis," in ICASSP. 2∞8. [8]. J. D. Markel and A. H. Gray, Jr.. Linear p即diction of speech.. Springer. Verlag, 1986. [9]. J. Rissanen,“Universal coding. information. p陀dict旧n. 叩d cstimation," IEEE Tral1s. on lnfomwtion Theory, vol. IT-30. July 1984. 1786. - 215ー.
(5)
図
関連したドキュメント
To overcome the drawbacks associated with current MSVM in credit rating prediction, a novel model based on support vector domain combined with kernel-based fuzzy clustering is
By an inverse problem we mean the problem of parameter identification, that means we try to determine some of the unknown values of the model parameters according to measurements in
Our goal in this paper is to present a new approach to their basic results that we expect will lead to resolution of some of the remaining open questions in one-dimensional
In an insightful essay, Behringer and Baxter (Mehta [55, page 107]) based on their experimental observations said, “In short, there is a need for a new kind of theory that includes
The possible results of the OMPR based observations analyzed in Section 6 could give an answer to the question which geometry suits best for the description of the physical
For performance comparison of PSO-based hybrid search algorithm, that is, PSO and noising-method-based local search, using proposed encoding/decoding technique with those reported
To address the problem of slow convergence caused by the reduced spectral gap of σ 1 2 in the Lanczos algorithm, we apply the inverse-free preconditioned Krylov subspace
Based on these results, we first prove superconvergence at the collocation points for an in- tegral equation based on a single layer formulation that solves the exterior Neumann