A Decision Tree-Based Clustering Approach to State Definition in an Excitation Modeling Framework for HMM-Based Speech Synthesis

全文

(1)� A decision tree-based clustering approach to state definition in an excitation modeling framework for HMM・based speech synthesis Ranniery Maiat， Tomoki Todat，+， Keiichi Tokudat，什， Shinsuke Sakai↑， Satoshi Nakamurat tNational Institute of Information and Communications Technology， Japan +Nara Institute of Science and Technology， Japan ↑↑Nagoya Institute of Technology， Japan. {ranniery.maia，shinsuke.sakai，satoshi.旧kamura}白川ct.go.jp [email protected]， [email protected] Abstract This paper presents a decision tree-based algorithm to cluster residual segments assuming an excitation mod巴1 bas巴d on state dependent filtering of pulse train and white noise. The deci sion tree construction principle is the same as the one applied to speech recognition. Here parent nodes ar，巴split using the resid ual maximum likelihood criterion. Once these excitation deci sion trees ar巴 constructed for residual signals segmented by full context models， using questions related to the full context of the training sentences， they can be utilized for excitation modeling in speech synthesis based on hidden Markov models (HMM). Experimental results have shown that the algorithm in question IS very e仔ect1ve 10 te口町 of clustering residual signals given segmentation， pitch marks and full context questions， resulting in filters with good residual modeling prope口ies Index Terms: speech synthesis， HMM-based speech synthesis， decision tree-based clustering， residual modeling.. 1. Introduction. M/ρ2 H抗旬(ωz吟 ) = 2乞二 h附附(い仰lり)z1=一M/22 Hu(z)=一一「互 1 - L�1 g(l)z-1. (2). ar巴 iteratively optimized in the sense of minimizing the mean squared e汀or of the system of Figure 1(b)， given by r::=. E{w2(n)} =E{{正(n) * [の)一h(n)吋(η)l}2}. ， (3). where g(l) and h(l) a閃 respectively the co巴f伍lìd記m巴白阻n山1 and H唱匂リ ( z吟) ， K is the gain term of Hu(z)， and g'(π) is the im pulse response of H;; 1 (z).. 幻沼 ι 口グ. Copyright @ 2009 ISCA. 2. The excitation modeling framework. Figure 1 depicts the excitation model described in [1]， which is applied to HMM speech synthesis instead of the excita tion scheme where pulse train and white nois巴 are assigned to voiced and unvoiced segments， resp巴ctively. During the synthe sis stage， shown in Figure I(a)， the voiced filter H匂(z) pro cesses the pulse train t(η) and outputs th巴 voiced exc山tlOn signal v(η) whωh mimics the voiced po口ions of the residual database. The unvoiced fì1ter Hu(z)， on the other hand， weights the noise sequence ÜI(η) to produce the unvoiced component 叫n) of the exc山tion signal ë(n). Voiced and unvoiced fì1t巴rs vary according to each HMM state position 8j from the sequence ( 81， • • • ， S s)， with S being the number of states cov ered by the sentence to be synthesized. During the training pa民， the陀S】dual signal e(n) becomes the ta屯et of the analysルby synthesis block of Figure 1(b) and the fìlters. ) 1 (. Efforts have been made recently to enhance HMM-based speech synthesizers with th巴 goal of making them abl巴 to produce close to natural speech whil巴 keeping some interesting properties， such as use of voice transformation techniques， utilization of small corpora and small footprint demand. By focusing on the synthesis engine part， some approaches have been repoれ巴d to improve HMM-based TTS systems through the design of better excitation modeling， e.g [1， 2， 3， 4]. In some of these techniques the idea of modeling auxiliary parameters by HMMs to enable the construction of a better excitation signal is performed. For instance， modeling of the so-called bandpass aperiodicity pa rameters and eventual use of the excitation scheme repo口ed in [5] at run-time is part component of the system described in [6]. ln [2] a sinusoidal modeling approach is proposed， in [3] the Liljencrants-Fant (LF) model is applied， and utilization of glottal inverse filtering is reported in [4]. ln [1] a residual modeling approach for HMM-based speech synthesis is described. The method is based on the princi ple of analysis-by-synthesis speech coders and consists in the optimization of state-dependent fi1ters coefficients through th巴 minimization of the difference between synth巴tic excitation and r巴sidual. Although good performance was achieved， state defì nition remained as an open issue. lnitially， as in the experiments presented in [1]， filter states were regarded as leaves of d巴C1sion trees for mel-cepstral coeffìcients. Later， in [7]， an state defìnition approach was reported， which consisted in merging. initial clusters r巴presented by leaves of decision trees for mel cepstral coefl白cients. The merging criterion utilized in that case was maximum likelihood (ML) of residual and good results in terms of 白1ter modeling wer巴 report巴d. However， since this ap proach still relies on the utilization of decision trees constructed for m巴トcepstral coefficients， it is natural to wond巴r if further improvements can be achieved if the entire clustering process was based on residual likelihood. This paper presents a way to define fìlter states of the excitation model of [1] through an aト gorithm that perfo口ns tree-based clustering of residual signals In Section 2 the residual modeling framework of [1] is out lined. In Section 3 previous approach巴s for state d巴自nition are re-vlS1t巴d， and in Section 4 the proposed algorithm for residual clustering is described. S巴ction 5 shows some experiments and the conclusions are in Section 6. 6-10 September， Brighton UK.

(2) generated from Hl\!l\'ls. FO. Hl\!l\'l state sequence (81，. .， s:;-). Residual (target signal) e(n). (a). 「'f'f'↑αz. p，. p，. P3. an appropriate size (MDL factor control). Besides， in a mor巴 general sense， the approach itself of making use of mel-cepstral coefficient likelihood to c1uster residual segments is at best a rough approximation for filter state de白mtlOn. To alleviate the drawbacks of the phonetic trees method， an algorithm to merge terminal nodes of th巴 usual decis】on trees for mel-cepstral co巴fficients using ML of r巴sidual segments has been proposed in [7]. This hybrid approach so-defined bottom up clustering when compared with the utilization of phonetic decision-trees presents the advantages of: (1) using residual ML to obtain the final filter c1usters; (2) no need to design speci自c question sets for c1ustering. In this cas巴， likelihood increm巴nt or number of clusters can be used as stopping criterion. 4. State definition by top-down c1ustering. Both procedures for state definition described in S巴ction 3 rely on the use of trees for mel-cepstral coefficients; on巴 complet巴Iy and the other partially. This section describes the proposed al gorithm which is entir巴Iy based on residual ML.. pz. White noise w(n) (error signal). 4.1. Clustering criterion: residual ML. (b). A則ming that the noise sequence w(n) which is output by filter G(z) in Figure I(b) is a Gaussian process， the log Iikelihood of the signal u(n) (also a Gaussian process) is given byl. Figure !: The assumed excitation modelingframework: (a) syn rhesis part; (b) training part.. 3. State definition for the excitation model. 断111|Hui=-3ωπ+jl叩TG|juVGu， (4) wh巴re N is the number of samples of the whole database and u = [u(O) ... u(N - l)]T， (5) (6) G = [g(O) ... g(N-1)]， . ー (m) _ ト fpー-.---' . . Q寺中平 g ヒー_g I . (7) ー 1m 一一一 tcrms N-m-l te口nsJ. u. From Figure ! 山an be noticed that filters H" (z) and H (z) are associated with each HMM state position. The entire process spanning from state determination to excitation model training can be enumerated through th巴 following steps:. 1. create/define states for the 巴xcitation model;. 2. quantize (classify) residual segments according to the defined states;. 、 T. 3. caIculate filter for each cluster of residual segments us ing the procedure described in [1] to achiev巴 the finaJ. The second term in th巴 right side of (4) can b巴 wntten as. excitation model.. _ N-1. 1. L. η=0. 1. 1=1. 12. 31og|GTG|= jZ叫1- 乞g)e川 (l I 1 - NlogK，. This paper focuses on the first two steps enumerated above. In the next sections two existing methods to perform this task aside from the proposed algorithm are outlin巴d. (8). and b巴cause G(Z) is minimum-phase， the first term in the right side of (8) is zero [8]. Further， if w(n) =去ω(n) is a wh巾 noise sequence with variance one and mean zero， and > > L， th巴 third t巴口n in the right side of (4) can be approximated as. 3.1. The phonetic decision-trees method. N. In the phonetic decision trees method for state assignment，創 ter 山tes for the excitation mod白制巴el， {SIし，.. .叶， ss ， with S be凶E引m t出h巴 numberoぱfs幻ta剖te白s， are regarded as terrninal nodes of decision trees constructed for the spectrum stream of the HMM-based synth巴sizer in which the excitation model is appli巴d to. The idea of utilizing trees for spectrum relies on the assumption that residual sequences are highly correlated with the spectral pa rameters from which they are d巴rived by inverse創tering. Based on empirical approaches， the b巴st tr巴es are the on巴s constructed using solely phonetic questions. In addition， the minimum de scnpt!O日length (MDL) factor， used to control the size of the trees， is set so as ju吋ross phonetic information， such as voiced， unvoiced， fricative， stops， etc， can be classified by the trees.. }. -juVGド-iK2(N+L仰2(η)}勾ード2N 間 Therefore， the Iikelihood of e( n) given the exc出tion mode12 is simply a function of the unvoiced filter gain component K， logP[elI山"tJ =十og27r - N (IOgK +手) (10). 4.2.. Clustering procedure. By taking into account the state-dep巴ndency of the filter coeffi cients， ( 10) can be re-written as. 3.2. The bottom-up clustering approach. ... s. 同P[eIHv， Hu，t] =ーす同2π+2ンj，. The use of phonetic decision trees for state definítion， as de scrib巴d in Section 3.1 presents some drawbacks. The first one consists in the fact that it is necessary to design sp巴cific phon巴tic questions to cluster mel-cepstral coefficient distributions， as well as some supervision to check whether the tree has achieved. (11). 1 ThrOl凹l昭hout this paper bold upper an� lower case letters represent matrices and vectors，問spect附ly， and [十-]1 me悶a叩ns t汀rans叩po山s臼“lt削t 2N呼ot包et出ha剖tP[u(n吋)IHuバ(z刈z)] {特キ P[ド何e可(nπ)川IH匂(z吟)，Huバ(z吟)， tκ(n吋)]. 円ペυ 市14 円，臼. 1784.

(3) where 乙J -一川. や. og Kj十. 手). ，. 10 cluster one HMM state position for the excitαtion model shown in Figure 1.. τable 1: !ter，αtive part of the algorithm. (12). is the lik巴lihood of e(η) under state 8j， Nj is its co汀'espond ing numb巴r of samples， Kj is the co汀esponding unvoiced自Iter gain， and 8 is the number of states (or clusters， assuming that we are dealing with tied states). From (12) one can see that the smaller the gain factor Kj is， the larg巴r will be the contri bution of cluster 8j to the overall likelihood， weighted by the number of samples of the clust巴r. In fact， considering voiced regions， a small Kj means that the power of the unvoiced ex citation u(吋= e(π)-v(吋of segments belonging to cluster 8j is small， which implies that the Hv(z) outputs a河口al v(η) which is clos巴 to the target e(n) in Figure I(b). To visualize the way to calculate乙j Jt IS necessary to con sider the block diagram of Figure l(b). Initially voiced filter coefficients are computed， followed by the determination of the unvoiced excitation component u(n)， finally leading to the gain component Kj. The process of splitting on巴 cluster into two can be thus sketched as follows: 1. split. 8j into 8i1. 1) For each cl uster 8jε{81，. . . ，8S} and each question qiε{ql， •・. ，qQ} (Q is the number of questions). 1.1) SpJit 8j into 8iJ and 812 according to question q i 1.2) Calculat巴 voiced filters hi1 and h12 using (14) 1.3) Calculate unvoiced filter gain components Kjl and Kj2 from (18) 1.4) Calculate乙iJ and乙j2 according to (12) 2) Select the cluster 8j and question q? that result in the largest likelihood increment given by (13) 3) Make， {...，8j，...， 8S}→{...，sjl' sj21"" SS+l}， {，.，，q?-I' q? ，q?+I"" ，qQ}→{ー・，Qf-l1 Qf+l1'・・，qQ-l} 4) If stopping criterion is not fulfilled go to Step 1 5) Stop with. Tjz(l) =. and 812 given a candidate question;. Ui(η) = α(n). 3. comput巴 unvoiced filter coefficients with co汀esponding gain components， gji' Kjl' g12' and K12'間spectively for 8iJ and 812・ After caJculating乙iJ and乙12 from KiJ and K12' respectively， using (12)， likelihood incr巴ment due to the split can be given by. Ai. ー(m). ei. = [t�O) ... t�M)]. 19 ・・ O. ti(O)・・・ti(Ni-1) _ 同� I rr瓦 tcロns. -.， - ，. 州、 -一 � = I1| � 亭 temS. -.，-. .. -，. Stopping crit巴rion can be set to a minimum likelihood incre m巴nt threshold. However， the same criterion usually employed in the clustering process during the t印刷ng ofHMM-based syn thesizers， namely the MDL criterion， can also be applied here. The advantage in this case is that the size of the trees can be sys tematically controlled based on the trade-off between likelihood mc問団ent and model complexity [9]. Let the description length of excitation model入i with clus ter 副 {8 ， . . ，8S;}， wher巴 8i is the number of clusters， each one of them having a voiced and an unvoiced filter， be given by. 1.. ふ: ι+ '(M Ez=E log hー>. (14). 巴i(川一 1) 己�r � I 4F町ms I. +L+ 2)8. (16). ムe=ム+1 - ei=. (17). + L +2 -.cinc + logN. 一一万一一一. M. The clustering process is stopped ifムe>. O.. 5. Experiments. OO. T. の L 乞同日U T 一一民. with ti(η) and ei(η) bei時respectively pulse train and res凶lal segments with Ni samples belonging to cluster 8jz' Segments are obtained by alignment perfo口ned at the HMM state level. Afler that， the gain Kiz is calculated from. 勺og N.. 2. (20). The first and second terms in the right side of (20) correspond to the lik巴lihood of入i wh巴reas the third te口n measures Jts com plexity [9]. After each split gtep， the di仔巴rence of description length between the model after the split，入包+ 1， and the model before the split，入i， is. (15). 0 ・・・ Q.1T � . 九1-η1. tenns II. -. 4.5. Stopping criterion. ー. where. (19). =. =. ζAfe. .. Decision汀ee construction for oneHM恥-1 state position starts by grouping all residual segments into a single cluster 81 (8 1). After that， split iterations are carried out as shown in Table 1. (13). The determination of voiced filters and unvoiced filter gain com ponents for {8fx Ix 1，2} implies optimization of filter coef fici巴nts and pulse trains for the new clusters， according to the algorithm described in [1]. In order to decrease computational complexity this iterative optimization is replac巴d by single cal culation of voiced filters followed by linear prediction analysis of the unvoic巴d exc山tion signal 叫η) under segments belo時ー ing to 8jz to derive the gain component Kfx. Assuming the diagram of Figure 1(b)， voic巴:d filter coeffi・ cients for cluster 8iz can be obtain巴d by using the least squares formulation， i.e，.. ( ベ. =O，.. ， L. I. n=口. 4.4. Algorithm for decision tree construction. 4.3. Approximations to decrease computational complexity. h3= 三. ìESjx. being th巴 sum of autocorre1ation sequences of all segments of hjz(η) * ti(n)， where iεSjx・The unvoiced fi1ter coefficients of cluster 8 jz' {gjz(1)，... ，giz(L)}， are de termined from T以1) by linear prediction using the Levinsor卜 Durbin algorithm [8].. 2. calculate voiced filter coef:白cients， hjl and h12， for the new clust巴rs 8jl and 8i1' respectively;. 乙mc -乙aft巴r一乙before =乙iJ+乙12ー乙3・. N.-l. L L ui(n)ui(n-l)，. (21). The ATR503 database recorded by a female Japanese speaker was utilized to test the proposed clustering algorithm. An HMM-bas巴d synthesizer was trained for this database. Aside from Fo， mel-cepstral coefficients as described in [6] formed the multi-stream observation vectors.. 4 汚 4 tn，白噌la a 吋/ ー.

(4) 、企Vooむ副主ho旨〉. ー竺enceー ) 閣毛竺副モ � ，.. 附 V“ rasd-orhbqi豆長ぬ�hones_�� 処白古川田岡崎四網開問岡. Table 2: Number of lerminal nodes al Ihe end of Ihe cluslering. モー. pro叩Ss. Number offull conlexl labels is 29395.. HMM state No. terminal nodes. I. 1. I. 2. I. 3. I. 4. :iD，主坦主丞'"心間. ，... �画型竺I;�白雪空竺>>色" 間V目. @五長主主o-�包固臼. 的. ，.‘. �，.‘. 囲自匂竺�自闘自回自. Figure 3: Decision Iree generaledfor Ihefirsl HMM slale. The 10. 15. 20 Splits岡. 25. 30. lerms“C-"and “L- " mean respeclively currenl and left conlext. 7きrminal nodes are represel1led by Ihe yellow reclangles.. 35. Figure 2・乙inc for each split山ralionfor all HMM slales. 5.1. Applying the proposed algorithm to the speech data Conditions. 、. 5.1.2. Resulls. { E ) Z. Residual signals were extracted from the speech database by inverse filtering using the same mel-cepstral coefficients em ployed to train the HMM synthesizer. Full context models were used to segment the residual signals at the HMM state level Pulse trains were d巴rived from pitch marks and eventually op timized for the residual segments through the procedure de scribed in [1]. Residual segments were then clustered using the algorithm described in Table 1. The MDL criterion as described in Section 4.5 was used to stop tree growth.. j i，l ! ll目L i ! ‘. 5.1.1.. 弘、. 200. ，. 、. / /' 40 20. 60. 80. Cluster. n. Figure 2 shows the evolution of乙inc given by (13) for all HMM state positions along the split steps. Table 2 displays the number of terminal nodes at the end of the process. A total of S 97 clusters were created. It can be noticed that the central states are the ones clustered with more details. This is expected since the central states include more segments. Figure 3 depicts the de cision tree constructed for the first HMM state. The interesting point to emphasize here is that although a question set related to full context features was used by the clustering algorithm， ques・ tions related to cu汀ent phone and its left context were the ones mostly selected during the process. The same result was also observed for the other states. This perhaps might explain why the phonetic tre巴s approach of Section 3.1 has been very 巴仔ec tive for residual modeling. One can also see in Figur巴 3 that the voiced branch is clustered with more details. This property was more clear for the central states. This larger granularity of the decision trees for voic巴d sounds was also expected.. Figure 4: lmpulse responses offilters Hv(z) derived using Ihe slale conβguralion yielded by Ihe proposed algorithm. =. 6. Conclusion. The proposed algorithm has shown to be effective for clustering residual 5ignals under the assumed excitation model， eliminat ing the rough approximation that has been applied 50 far with the use of trees for meトcepstral coefficients. Although自Iters are computed during the clustering process， the algorithm works just as a state definer. An unified method for state clust巴ring and filter optimization is in future plans.. 7. References. [1] R.M血a，T. Toda， H. Zen， Y. Nankaku. and K. Tokuda.“'An excitation model for HMM-based spcech synthesis based on residual modeling，'・in SSW6， 2∞7. [2]. l3J. 5.2. Excitation modeling. O. Abdel-Hamid， S. Abdou， and M. Rashwan，“Improving the Arabic HMM based speech synthesis quality." in ICSLP. 2∞6 J. Cabral. S. Renals. K. Richmond. and J. Yamagishi， .‘Towards an improved. modeling of the glottal source in statisticaI parametric spcech synthcsis，" in SSW6.2∞7. The state configuration created by the clustering process was utilized to train an excitation model for the HMM-based syn thesizer. Voiced and unvoiced filters were det巴rmined for each terminal node of the constructed trees through the algorithm de scribed in [1]. Figure 4 shows the impulse responses of the finaI voice filters. It can be noticed that convergence was achieved for most of the filters. This shows that the proposed algorithm was successful in clustering residual segments based on the context determined by the questions. On the other hand， a few visi ble samples of noisy impulse responses in Figure 4 infer that bad segmentation andlor pitch marking mistakes may have con tributed to that. However， segmentation and pitch marking is sues li巴 beyond the scope of this paper.. [4]. T.Raitio， A.Suni. H. P ulakka， M.Vainio， and P. Alku，“HMM-bascd Finnish text-to-spccch system using glottal invcrse tiltering." in lnterspeech， 2008. [5]. H. Kawahara， J. Estill， and O. Fujimura，“Apenodicity cxtraction and con trol using mixed mode excitation and group delay manipulation for a high quality sp田ch analysis， modification and synthesis system STRAIGHT，" in MAVEBA.2ool. [6]. H. Zcn， T. Toda，M. Nakamura， and K. Tokuda，“Details of thc Nitech HMM based spcech synthesis for Blizzard Challcnge 2∞5，" IEICE Trans. on lnf and Systems， vol. E90-D， Jan. 2007. [7]. R.M副a，T. Toda， K.Tokuda， S.Sakai. and S.N依amura，“On the statc for an excitation model in HMM-based spcech synthesis，" in ICASSP. 2∞8. [8]. J. D. Markel and A. H. Gray， Jr.. Linear p即diction of speech.. Springer. Verlag， 1986. [9]. J. Rissanen，“Universal coding. information. p陀dict旧n. 叩d cstimation，" IEEE Tral1s. on lnfomwtion Theory， vol. IT-30. July 1984. 1786. - 215ー.

(5)