On the Use of Phonetic Information for Mapping from Articulatory Movements to Vocal Tract Spectrum

全文

(1)ON THE USE OF PHONETIC INFORお1ATION FOR恥1APPINGFROおf ARTICULATORYお10VEMENTS TO VOCAL TRACT SPECTRUM Kenichi Nakamura t， Tomoki Toda*， Yoshihiko Nankakut， Keiichi Tokudat t Nagoya Institute of Technology Gokiso-cho，Showa-ku，Nagoya，Aichi，466-8555 Japan :j:Nara Institute of Science and Technology，8916・5 Takayama， Ikoma，N訂a，630・0101 Japan { k-n，nankaku，tokuda } @ics.nitech.ac.jp [email protected]. ABSTRACT. This pap巴r describes a method for determining the vocal tract spectrum from articulatory movements using an hidden Markov models (HMMs). In the proposed system， articulatory pa rameters are generated from a TIS system and converted to acoustic features to be synthesized. Comparing with conven tional GMM-bas巴d systems， the proposed system has two ad ditional properties: 1) phonetic information given input texts is available for the conversion， 2) the use of HMMs allows us to utilize the temporal s甘ucture of sp巴ech. In this paper， we investigate the optimal structure of HMMs for the conversion. Experim巴ntal results show that using phonetic and temporal information can improve the mapping accuracy in a sp巴ctral distortion measure 1. INTRODUCTION Many att巴mpts to synthesize speech based on speech produc tion mechanisms which ar巴 ignored in concatenative synthesis have been studied for several decades. In these approaches， the speech signal is generated from articulatory parameters [1] by a mathematical production model in which speech is characterized by出e properties of the vocal apparatus instead of the spe巴ch acoustics. Slowly varying articulatory param eters are better candidates of features for speech modeling. Furthermore， the speech signal can be modifìed in an under standable way by manipulating articulatory parameters rather than acoustic parameters such as vocal tract spectrum. Figure 1 shows the proposed TIS system with articulatory parameter conversion. First， Fo and articulatory parameters are generated from the HMM-based TIS [2] which could be 巴asily constructed by using articulatory parameters as train ing data. Then， articulatory parameters are modifìed to re alize various speaking styles and they are converted to spec trum features， e.g.， mel-cepstrum. Finally speech wav巴forms are synthesized from converted parameters and Fo by using a speech synthesis fìlter. In the conversion system， the mapping between acous tic and articulatory features is statistically d巴termined using a parallel acoustic-articulatory speech database [3]. As a way to implement the transformation function for converting ar ticulatory parameters to speech， the GMM-based system [4]， and the HMM-based system [5] have been proposed. How ever， a detailed comparison between these systems has not This work was partly supportcd by MEXT c-Society project. l・4244-0469-Xl06/$20.00 iÇ)2006 IEEE. Fig.1. TIS with articulatory parameter conversion. be巴n performed. Comparing with the GMM-based system，出e proposed system has two additional properties: 1) pho netic information given input texts is available for the con version process， 2)出e us巴 of HMMs allows us to utilizes th巴 temporal structure of spe氾ch. In this paper， we pr巴sent a method for deterrnining speech acoustics from articulatory movements using multi-mixture HMMs. We investigate the optimal structure of multi-mixture HMMs with context clustering [6]. To vary th巴 degree of us ing the phonetic information and temporal structure of HMMs， we change the size of decision trees in context clustering， the number of HMM states， and mixture components while keep ing the total number of model parameters fìxed. The rest of this paper is organized as follows. In the following section， we introduce the MOCHA database. In Section 3， the HMM-based speech conversion system is de scribed. The maximum likelihood spec汀al estimation using dynamic features is applied to the HMM-based mapping in Section 4，叩d speech synthesis with the estimated spectral sequence is d巴scribed in Section 5. Finally， we summarize 出is paper in Section 6. 2. ACOUSTIC-ARTICULATORY SPEECH DATABASE : MOCHA. The Multichannel articulatory database (MOCHA) [3] has been released from the University of Edinburgh. It consists of sp巴巴ch and some articulatory movements simultan巴ously recorded at Queen Margaret University College. We use electromagnetic articulograph (EMA) data， one of representations of articulatory data provided in the MOCHA， as an articulatory representation. The movements of seven. 1 - 93. ICASSP 2006. 円i -aA 市EA.

(2) In the HMM-based conv巴rsion system. we construct feature vectors using param巴ters obtained from the parallel acoustic and articulatory speech database. The feature vector consists of mel-ceps町al coe節ci巴nts as spectral paramet巴rs. EMA data as articulatory p紅ameters. and their delta and delta-delta pa rameters. The joint probability densities of articulatory and acoustic spectral parameters are modeled by the HMM using th巴se feature vectors In the training stage. first. monophone HMMs are esti mated by the isolated training and the following embedded training. After converting to context dependent HMMs. they are re-estimated by the embedded 町aining. To avoid inaccu rate estimates caused by a limited amount of data. we apply the tree-based cont巴xt clustering technique [6].. 4.1. Maximum likelihood spectral estimation In the maximum lik巴lihood (ML) spω回1 estimation. given the articul制'Y features. To conve民the articulatory parameters to acoustic ones. the joint probabiIity densities over two features 紅e trained using the HMM. Each articulatory location is shown by x- and y coordinates. therefore articulatory movements 紅巴 represented as 14 dimensional vector s巴quence. Moreover. the proposed syst巴m can represent probability densities more precisely using multi-mixture comp紅ed with the conventional HMM-based be articula司 and system using single mixture [5]. Let tory and acoustic featur巴 vectors. respectively. Let the vector. Yt. Z t = [XJ.YJr be a joint feature of these two features. and its vector sequence Z = [Zi. Zr. . . . z; r is modeled by the HMM À. The output probability of Z given the HMM can be. •. ]. .. ー. 可γ. I. is ob・. 〆MA)=55lMq|XJ). (1). q.X.À) 日 p(Yt Xt• mt.q" À) ]. 例 I. where the output probability distribution is written as follows:. p(Y，1 X"qt =j.m， = =N(Y，; Ej，j(t).Dj，j) i.λ). (5). and. Ej，j(t) =μ(?+zjアzjjゎ-1(Xt-A?)， D;; = J，) I;(YY) - I;(YX)I;(XX)-II;(XY). ・・J. '. }. l.}. I，J. (6) (7). Furth巴rmore. the post巴rior state transition probability p(q I x.λ) is also calculated us印d the mixture weight and model p紅ametersλ. Since ing articulatory parameters equation (4) includes hidden variables. th巴 optimal sequence is estimated via the EM algorithm. The EM algorithm of is an iterative method for approximating the m但imum like lihood estimation. It maximizes the expectation of the complete data log-likelihood so cal1edQ-function (auxiliary func tion):. .X.λ p(ml Xq ). Y. Q(Y.. Î'). =1:: 1::. [p(m. q I. 叫Iq allm. Y.X.. À)log p(Y. m. q I x. À)]. (8). Taking the derivative of theQ-function. the spec位al sequence Y which maximizes出eQ・function is given by. f=(F)-1日. (9). where. =(ml.m2.・・・ .mr). (2) 1 - 94. -118-. D-1= [Dïl.不，耳] • D戸=1::1:γ: ιJ(t)D;j， diag N. Mj. 日= [可河2T， RTT[ D-1Et =エエγj，j(t)Di.J Ej，j(仏 N. MJ. (10) (. where q = (ql. q2γ・・. qr) is a state sequence of the HMM. denotes a component numb巴r sequenc巴 m of mixture distributions. The probabilities p(q Iλ)叩d p(mlq.λ) denote a state transition probability and mixture weights of output probability. respectively. In this pap巴r. the mixture component is assumed to be a Gaussian distribution:. p(Ztlmt=叩=μ)=N(zdj)，zjj))，. . ・・・. 、‘，J 'EA --A. p山t.q"À). I. tained by maximizing th巴 following conditional probability.. written as follows. 崎. as an input.. r. T. 4. TRAINING JOINT PROBABILITY DISTRIBUTION WITH ITh品f. Iλ 川. X= [Xi.Xr.. .. ー• X;r Y = Y_; • Y2 Y�. 出e optimal spec回1 features. xp (ml. In the conversion stage. first. th巴 text to be synthesized is converted to a context dependent label s巴qu巴nce. Then. th巴S巴ntence HMM is constructed by concatenating context dep巴ndent HMMs according to the label sequence. Articula tory param巴ters are converted to spec町um features based on the maximum likelihood estimation. Finally. a speech wave form is synthesized from the generated p紅白neters by using a speech synthesis filter.. 山=22 ト(q. (. -=. whereμand I; denote a mean vector and a covariance matrix. respectively目In the above-mentioned condition. the parame t巴rs of the HMM À is estimated via出e EM algorithm，. 3. CONVERSION SYSTEM OVERVIEW. Xt. ) 弓3. i VE vs J -JF X t i ( z z わめ x JT J z z i - Z J Z l m川 Jη J μ μ 'E EE' ' rE'ga' HV げW μ. articulators (top lip. bottom lip. bottom incisor. tongu巴 tip. tongue body. tongue dorsum. and velum) and two referenc巴 points (the bridge of nose and the upper incisor) are sampled in the midsagittal plane at 500 Hz.. (12) (13).

(3) γiþ). =. == =. p(qt j I X， Y， À) xp(mt ilqt j，X， Y，λ).. Table 1. Exp巴rimental Condition. (14). The occupancy probability Yi，j(t) can be calculated by the forward-backward algorithm. Using the updated probabilities Yi，j(t)， a new vector sequence Y is calculated by equations (9)，. z. and then Y is substituted for Y. This procedure is iteratively performed until a certain conv巴rgence condition is satisfied.. 4ムMaximum likelihood spectral estimation using dynamic features. location (2-dimensional coordinate). In this pap巴r， w巴 appropriately estimate the sp巴C仕al feature sequence using dynarnic features as d巴scribed [4， 5J. Let. =. =. [x;r，ð.x;r， ß2x;rf and Yt [y;r， ßy;r， ß2y;rf be an 紅 ticulatOlγfeature and an acoustic feature， respectivel子Where xt and yt denote static features， and the notations， .1 ， .12 rep resent first and second order dynamic features， respectively， calculated from the neighboring frames of time t. The rela tion between the static spectral sequence y = [y 1 T， y T，・ 2 'YT Tr and the static-dynamic features Y can be written as the following linear transformation:. Xt. Y. =. Wy. 3. Perform血巴 context clustering until the predetermined number of clusters紅'e generated.. (15). 4. Back off the tree in the reverse ord巴r of divisions until the designed size of tr巴e.. where W is a matrix which concatinates dynamic features to the static f，巴ature sequence y. Under this relation， the static feature vector sequence y which maxirnizes 巴quation (8) is given by. y. = (WTD-1W( WT百. 附. Sirnilarly to equation (9)， the update is iterated until a certain convergence condition is satisfied. 5. EXPERIお1ENT 5.1. Experimental conditions. We investigated the effectiveness of using phonetic and tem poral information by varying the importance of these proper ties. The acoustic-articulatory data described in Section 2 was used. Experimental conditions are shown in Table 1. To investigate the mapping accuracy of the HMMs， we fixed the total number of parameters of HMMs， then assigned them variously. Wher巴 the importance of temporal informa tion is represented by the state number of 1品1Ms， and that of phonetic information is repr巴sented by th巴 size of decision tree in context clustering. 1n context clustering， a larg巴 single tr田including all出phone H�仏1s was constructed for each temporal HMM state， which allows parameter sharing among differ巴nt phone HMMs. Furthermore， to assign白巴 optimal number of rnixtures for each state (cluster)， we apply the fol lowing procedure: 1. Construct a root node for all states of all HMMs. 2. Apply the questions which divide all temporal HMM states.. 1. 5. In the new leaf node obtained in 4， nodes of the sub佐田 are used as th巴 rnixture components， and their weights are deterrnined by the∞cupancy count of the trai凶ng data. The variance p紅ameters of lTh仏1s were trained as diagonal covariances， and aft巴r the context clustering they were esti mated by the embedded training as full covariance matrices. In Section 4.2， we presented the process which iteratively estimates a spectral feature sequenc巴 and posterior probability distributions of the state transition and the rnixture compo nents. However， in this experiments we use出e state align ment generated from the natural articulatory-acoustic data， hence only the posterior probabilities of rnixture components were re-estimated， iteratively. In the experiment， Fo sequenc巴s which automatically extracted from natural speech are used for synthesizing speech to focus on the spec位al conversion. The mel-cepstral distortion between the target and the es・ timated mel-cepstrum given by the following equation was used as the evaluation measure: Mel小. 品刊 2( 2. j. j )2. mc t)-mc e). 間. where mc�1) and mc�e) denote the i-th coefficient of the target and the estimated mel-cepstrum， respectively. 5.2. Experimenta1 results. To investigate only the effectiveness of phonetic information， we apply the context clustering to the GMM-based mapping. Figure 2 shows the MelCD of the GMM-based mapping with ∞ntext clustering， which is equivalent to the multi-mixture. - 95. QJ 唱EA.

(4) 5.5. Ciì6 tコ口仁J. 主. 6.5. \6. Fig.2.. tree size. 64. 256. 1024. 6.2. 16. 1. 32. 64. 128. 256. 512. tree size. =. M巴ICD of the GMM-based mapping with context clustering (� of HMM states 1). HMM-based mapping with the number of states is one， hence temporal information could not be modeled. As the decision tree becomes large， phonetic information becomes positively used， and the left end of the graph indicates the mapping with out using phonetic information， which is equivalent to the conventional GMM-based method. It is observed that the mapping accuracy without phonetic information decreases in proportion to th巴 decrease of the number of mixtures. However， the mapping using phonetic information (the tree size is around 64) achieved high accu racy even with fewer p紅ameters. Furthermore， the result of 64 Gaussians with phonetic information is superior than the conventional GMM mapping of 1024 Gaussians. These re sults show that phonetic information is useful for conve口mg articulatory features to acoustic ones. We investigate the e恥ctiveness of introducing temporal information to the GMM-based mapping. Figure 3 shows出巴 M巴ICD of the multi-mixture HMM-based mapping， where the total number of Gaussian distributions is fixed to 512. It can be seen that the mapping accuracy can b巴 improved by using t巴mporal information. However the use of too many HMM states degrades the p巴rformance， which may be due to inadequate state alignments. The tree siz巴s， which achieved the highest accuracy in each number of HMM state tend to increase with the increase of HMM states. 1t is supposed that， indep巴ndently of the number of H恥仏1 states， a similar num ber of c1usters is r巴quired for each HMM state to represent its context d巴pendency. This result suggests that the simultane ous use of phonetic and temporal information is e仔ective for the conversion system.. Fig. 3. MelCD of the HMM-based mapping (total. sians = 512). � of Gaus. Acknowledgments: Authors 紅e grateful to Dr. Heiga Zen. for h巴Ipful discussions. 7. REFERENCES. [1) M. M. Sondhi， “Articulatory modeling: a possible role in concat巴native text-to-speech synthesis，" IEEE 2002 w，フrkshop on Speech Synthesis， Sept. 2002. [2) T. Yoshimura， K. Tokuda， T. Masuko， T. Kobayashi， and T. Kitamura， “Simultaneous modeling of spectrum， pitch and duration in hmm-based speech synthesis，" Eu rospeech， vol. 5， pp. 2347-2350， Sept. 1999. [3) A. Wrench，“http://www.cstr.ed.ac.ukfartic/mocha.html... Queen Margaret University College， 1999. [4) T. Toda， A. W. B1ack， and K. Tokuda， “Mapping from ariticulatory movements to vocal tract spectrum with gaussian mixture model for ariticulatory speech synthe sis，" 5th ISCA Speech Synthesis <<ゐrkshop-Pittsburgh， pp. 31-36. June 2004 [5) S. Hiroya and M. Honda， “Estimation of articulatory movements from sp∞ch acoutstics using an hmm-based speech production model，" IEEE Transactions on Speech and Audio Processing， pp. 175-184， Mar. 2004. [6) J. J. Odell，. 6. CONCLUSION. In this paper， we examined an e仔'ectiven巴ss of using phonetic and temporal information for converting articulatory move ments to vocal tract spectrum. In the obリ巴ctive evaluation， it was confirmed that mapping accuracy is improv巴d by us ing both phonetic and temporal information. Future works include investigating more e仔ective contexts for articulatory acoustic conversion. Constructing a TIS system for synthe sizing articulatory features and Iistening tests are also future works.. 1 - 96. -120-. The Use of Context in Large Vocabulary Speech Recognition， Ph.D. thesis， Cambridge University，. 1995..

(5)