Voice Conversion Algorithm Based on Gaussian Mixture Model Applied to STRAIGHT

全文

(1)The Seventh Westem Pacific Regional Acoustics Conference. Kumamoto， Japan， 3・5. October. 2000. VOICE CONVERSION ALGORITH乱1 BASED ON GAUSSIAN 恥1IXTURE MODEL APPLIED TO STRAIGHT. Tomoki TODA， Jinlin LU， Satoshi NAKAMURA， Kiyohiro SH1KANO Graduate School of 1nformation Science， Nara 1nstitute of Science and Technology 8916-5 Takayama， 1koma， Nara， 630-0101 JAPAN. Fax: +81・743司72-5289 E-mail: { tomoki-t，l叩. ABSTRACT Voice conversion is a technique used to convert one speaker's voice into another speaker's voice.. As a typical voice conversion algorithm， the codebook mapping. method has been studied by Abe et al.. The main shortcoming of this method is. the fact that the acoustic space of a speaker is limited to a discrete representation.. To represent the acoustic space continuously， the algorithm based on the Gaussian. mixture model. ( GMM ). has also been proposed by Stylianou et al.. 1n this paper，. we apply this GMM-based voice conversion algorithm to STRA1GHT proposed by Kawahara et al.， which is recognized as a high quality vocoder. 1n order to evaluate this voice conversion algorithm， we performed subjective and objective experiments on speaker individuality and speech quality， comparing with the method based on the codebook mapping. As results， the performance of the GMM-based voice conver. sion algorithm is better than that of the codebook mapping method. E:ffects by the amount of training data for the voice conversion algorithms were also investigated，. as well as the number of the Gaussian mixtures. These evaluation results clarify that the GMM-based voice conversion algorithm is successfully applied to STRA1GHT. KEYWORDS:. voice conversion， codebook mapping， Gaussian mixture model， STRA1GHT. INTRODUCTION As a typical voice conversion algorithm， the codebook mapping method has been studied by Abe et al.. [1].. The main shortcoming of this method is the fact that the acoustic. space of a speaker is limited to a discrete representation because of vector quantization usage. To represent the acoustic space continuously， the algorithm based on the Gaussian. 267.

(2) Mixture Model (GMM) has also been proposed by Stylianou et a1.. [2].. In this GMM-based. a.lgorithm， the acoustic space of a. speaker is modeled by the GMM， and acoustic features are converted from a source speaker to a target speaker by mapping function based on the Gaussia.n mixture. Voice conversion is usually perfo1'med with an analysis-synthesis method， where quality of. the synthesized speech is also importa.nt to rea.lize a high quality voice conversion algorithm. STRAIGHT (Speech T1'ar. weiGHTed spect 川n吋 ) p戸r、叫 oposed by Kawa祉油ha剖1'a et a1. iおs a.n ana.lysis-synthesis method and can. synthesize high quality speech. [3]. In this paper， weapply the GMM-based voice conversion algorithm to STRAIGHT， and evaluate this voice conversion algo1'ithm， comparing with the method based on the codebook mappmg.. VOICE CONVERSION ALGORITHM BASED ON GMM. æ{[xo， X1， ...， Xp-1]T} (source (target speake山) are determined by Dynamic Time. We assume that p-dimensional tirr時aligned acoustic features. speaker's) and ν{[Yo，. ・・・，. Y1，. Yp-1]T}. Warping (DTW)， where T denotes transposition In the GMM a.lgorithm， the probability distribution of acoustic features. wrítten as. p(æ) =乞αiN(æ;μゎ:Ei)， i=1. where. :E .αi. 玄αi = 1，. æ. can be. αs三0，. i=1. 、、‘，，r 噌E4 p，，sa‘、. GMM.. N(æ;μ， :E) denotes the ∞ 口rma叫1 di民st1'巾u凶凶tiぬon with mean vector μ and covariance matrix denotes a weight of class i and m denotes the numbe1' of the Gaussian mixtures.. Since the acoustic space of a speaker is modeled by the GMM without the use of vector. quantiza.tion， the GMM・based a.lgorithm distortion for the represented acoustic space is less. than that of the codebook mapping method. Mapping Function.. The mapping function converting acoustic features of the source. speaker to those of the target speaker is given by. F(æ) =乞ん(æ)[l-Lf where. μf. and. μf. +. :Ef'"(:E7，")-1(æ -I-Li)]，. [2]. ん(æ)二. 向N(æ;μf， :Eア) L:�1αjN(x;μj， :Eア). denote mean vectors of class i of the source and ta1'get speakers.. denotes covariance matrix of class i of the source speaker.. :Et. (2) :E7'". denotes the.cross-covariance. matrix of class i of the source and target speakers. In this paper， these matrices are diagona1. In order to estimate parameters joint vecto1's z. (α"μ??μf ， :E� z ， :EfZ)，. the pro bability distribution of the. EM algorithm and can be written as. zz甘TZ. :E�ー|272 27v l i - I I:t I:ry I. 11 11pd 11 μμ 一一 z z μ. = [æT，νT]T of the source and target speakers is represent :Et and mean vector μi of class i of joint vectors a.re. Covariance ma.trix. by the GMM. [4].. estimated by the. (3). Since acoustic features are converted by this mapping function that utilizes feature pa rameter correlation between two speakersヲthe converted speech is 1'epresented more contin uously than that of the codebook ma.pping method.. - 170 268.

(3) T. Toda. APPLICATION OF THE VOICE CONVERSION ALGORITH恥1S TO STRAIGHT The cepstrum of the smoothed spectrum analyzed by STRA1GHT is used as acoustic fea tures. 1n this paper， the cesptrum order is 40. ( quefrency. is 2.5 ms， sampling frequency is. 16000 Hz ) . 1n order to perform voice conversion， the 1 to 40-th order cepstrum coefficients. are converted， and the O-th order cepstrum coefficient， power， is kept as the value of the source speaker.. As for the source information， the average of log-scaled pitch frequencies. of the source speaker is converted to that of the ta1'get speaker.. The prosodic dynamic. characteristics of both speakers are not considered. EVALUATION 1n order to evaluate the pe1'fo1'mance of the GMM-based voice conversion algo1'ithm that was applied to STRA1GHT， we performed experiments on speaker individuality and speech quality， comparing with the method based on the codebook mapping. The male-to-male and female-to-female voice conversion was performed in each experiment. 1n o1'der to evalu. Objective Evaluation Experiments on Speaker lndividuality.. ate converted speaker individuality of the GMM-based voice conversion algorithm， objective evaluation experiments were performed by the cepstrum distortion. ( CD ). between the con. verted speech and the target speech. Ten sentences were used to evaluate， which were not included in the training data. First， in order to investigate the relation between the number of classes and CDs， CDs for the converted speech by both voice conversion algorithms were calculated. Fifty-eight sentences were used as the training data. The experimental result is shown in Fig. 1. CDs decrease at both voice conversion algorithms as the number of classes increases. The CD per司 formance of the GMMゐased voice conversion algorithm is superior to that of the codebook mapping method. Next， in order to investigate the relation between the amount of training data and CDs， CDs for the converted speech by the GMM-based voice conversion algo出hm. classes ) and the codebook mappi時method. ( 16，. ( 16，. 64. 64， 256， 1024 classes ) were calculated. The experimental result fo1' the female-to-female voice conversion is shown in Fig. 2. CDs increase when the amount of training data is insu自cient， because training of parameters of the mapping function is not enough. The result for the male-to寸.nale voice conversion is similar to that of the female-to-female voice conversion. Subjective Evaluation Experiments on Speech Quality.. 1n o1'der to evaluate quality. of the GMM-based converted speech， subjective evaluat.ion experiments were pe1'formed. Eight listeners participated in the experiments. An opinion score for evaluation wa.s set to. be a. 5・point sca.le ( 5: excellent， 4: good，. 3:. fair， 2: poor， 1: bad ) . Four sentences were used. to eva.luate， which were not included in the tra.ining da.ta.. First， in order to investiga.te the rela.tion between the numbe1' of cla.sses a.nd speech qua.l ity， the converted speech by the GMM-ba.sed voice cor附rsion a.lgo1'ithm the codebook ma.pping method. ( 16，. ( 16，. 64 classes ) a.lld. 64， 256， 1024 cla.sses ) wa.s usecl. Fifty-eight sentences. were usecl as the tra.ining da.ta.. The experimenta.l result is shown in Fig.. 3.. Speech qua.l. ity becomes better a.t both voice conve1'sion a.lgorithms as the number of cla.sses increa.ses. The perfonnance of the GMM-basecl voice conversion a.lgo1'ithm is superior t.o t.ha.t of the codebook ma.pping method.. 171. 269.

(4) 3.8. 3.8. 3.6 3.4 主. 8 3 f '、\九 2.8 r--....，. 2.6 2.4. I. ー'. 1j?:二~ 、曹、、、 �‘ ~ 卓司‘ 司、、悩ドー一二. I:!. . ，.，-:-;-ー費一. .. 事3.2 8 3 2.8 2.6. 一一. 2.4. 1000. 100 10 Number of classes. Fig. 1: Relation between the number of. 2. 100 10 Number 01 sentences used for training. Fig. 2: Relation between the amount of training data and CD.. the classes and CD.. 4.5. 一一 Original speech )(. Codebook 16 ← Codebo。民64 _.，_ Codebook 256 -0- Codebook 1024 ・ GMM 16 -0- GMM 64. 3.4. ><. [iì 3.2 十主. 3.6. 一一Original s問ech (Malel) Malel- Male2 Codebook .-t- Malel- Male2 GMM | ーーOriginals田ech (Femalel) l→← Fernale1- Female2 Codebook | -0 . Fem副el- Female2 GMM 、\. '-----' ，...， <:;耳.<. 置r. 4.5 4. T. |悩yrl. mi. g;:山崎:ii:唱団四四四 :i • Codebook mapping. Fig. 3: Relation between speech quality and the l1umber of the classes.. Origînall Speech "，.. 町、er �_r _1 ___._____ used ._.... .... for ， training ofsentences Fig. 4: Relation between speech quality and the amount of training data 011 GMM (64 classes).. Next， in order to investigate the rela.tion between the a.mount. of training dat.a and speech qua.lity， t.he converted speech by the GMM-ba.sed voice conversion a.lgorithm used. The experimenta.l result is shown in Fig.. 4.. (64 cla.sses ). wa.s. Speech quality becomes better a.s the. a.mount of training da.ta. is la.rge. When the a.mount of tra.ining da.ta. is insu伍cient， speech quality is a.lso low， because tra.ining of pa.ra.meters of t.he ma.pping function is not enough.. CONCLUSIONS We a.pply the voice conversion a.lgorithm ba.sed on the Ga.ussian Mixture Model to STRAIGHT， and evalua.te t.his voice conversion a.lgorithm.. ( GMM ). We performed eva.luation. experiments on speaker individuality and speech quality， comparing with the method based. on the codebook mapping. As a. result， the performa.nce of the GMM-based voice conversion algorithm is better tha.n tha.t of the codebook ma.pping method. Effects by the a.mount of training data. for the voice conversion algorithms were also investigated， a.s well as the number of the Gaussian mixtures. These evaluation results clar汀'y that the performance becomes better a.s the number of mixtures increases and the amount of training data is large.. REFERENCES 1. 2.. 3.. 4.. M. Abe， S. Nakarnura， K. Shikano， H. Kuwabara， “Voice conversion through vector quantization，". Proc. ICASSP， pp. 655-658， 1988. E. Moulines， “Statistical rnethods for voice quality transformation，" Proc. EUROSPEECH， pp. 447-450， 1995. H. Kawahara， 1. Masuda-Katsuse， and A. de Cheveigné， "Restructuring speech representations using a pitch-adaptive tirne-frequency srnoothing and an instantaneous-frequency-based FO extraction: Possible role of a repetitive structure in sounds，" Speech communication， 27， pp. 187-207， 1999. A. Kain， M.W. Macon，“Spectral voice conversion for text-to-speech synthesis，" Proc. ICASSP， pp. Y. Stylianou， O. Cappé， and. 285-288， 1998.. 270.

(5)