Statistical approach to enhancing esophageal speech based on Gaussian mixture models

全文

(1)STATISTICAL APPROACH TO ENHANCING ESOPHAGEAL SPEECH BASED ON GAUSSIANお1IXTURE MODELS Hironori Doi， Keigo N，αkωnura， TOll1oki Todα， Hiroslzi Sαrutl'atαri， and Kiヲohiro Shikano. Graduate School of Inforrnation Science， Nara lnstitute of Science al1l1 Technology， Japan E-mail:. { hironori-d.kei-naka. tomoki‘sawatari， shikano } @is.naist.jp. ABSTRACT. lhe trained GMMs so lhal il sounds like n0l111al speech.. This papcr prcscnts a novel method of cnhancing esophagcal specch using statistical voicc convcrsion.. Esophageal spcech is one of. Because. the converted speech is generated from slatistics extracted from nonnal speech， this conversion process is expeclcd 10 remove Ihe. Fo pallern uf Ihe. the alternative speaking methods for laryngeclom巴es. Allhough il. spec)自c noise sounds effcctively訓ld improve the. doesn't require any eXlernal devices， generaled voices sound unnat. 針。phageaJ speech. Howewr， the converted specch basically sounds. ural. To irnprove the intelligibility and naturaJness of esophageal. Iike voices ullered by a differenl speaker frol11 Ihe laryngecton】ee.. speech， we propose a voice conv巴rsion melhod from esophageal. To make it possible to日cxibly control thc convcrtcd voice quality，. A spectral pararneter and excitation. we also apply one-to-many cigcnvoice conversion lEVC) [5) to ES. pararnelers of target norrnal speech are只eparately eslimaled白om. to-Spccch. Thc onc-IO-many EVC is a convcrsion mcthod from a. a speclral parameter of lhe esophageal speech based on Gaussian. single source sp巴akcr into any arbitrary target spcakcrs. This mcthod. speech inlo normal sp巴ech.. miÀture models. Th巴 experimenlal results demonstrate thal the pro・. allows us to control lhe convel1ed voice quality by manipulating a. posed method yields significant improvemenls in intelligibility and. small number of parametcrs or 10 flcxibJy adapt lhe conversion. naturalness. We also apply one-lo-many eigenvoice conversion 10. model for Ihe given spccch �amples. This mcthod secms very cffcc. esoph暗eal speech enh‘n l cement for nexibly conlTolling enhanced. tiv巴 for helping Jaryngectomees to speak in their favorite voices 01' special voices sounding like their own voices thal have already been. voice qualily.. 11ldex Termsーl訂yngeclomees. esophageal speech， speech en. lost.. 2. ESOPHAGEAL SPEECH. h，lncemenl. voice conversion， eigenvoice conversion.. J shows an example of speech wavefonns， spectrograms.. Figure. 1. INTRODUCTION People who have undergone a total laryngectomy due to an accid巴nt or laryngeaJ cancer cannot produce speech sounds becau只e their vo cal folds have been rernoved. EsophageaJ speech iミone of the alter nalive speaking melhods for laryng巴clomees. Excilation sounds are produced by releasing gases from or Ihrollgh lhe esophagus， and then they are articulaled for generating esophageal speech. Esophageal speech sounds more nalural than speech generaled by other alter native speak.ing methods such as eleclrolaryngeal speech.. How. ever， degradalion of lhe int巴lligibilily and naluralness in esophageal speech is caused by several faclors such as sp巴citic noises and rela悶 tively low fundamental frequency. Conseqllently. esophageal speech sOllnds unnatural compared with nomlal sp巴ech.. and. ro contours of normal speech uttered by a non-laryngectomee. and esophageal speech lItlered by a laryngectomee in the same sen lence. We C.ln see that acouslic features of esophageaJ sp巴ech are consid巴rably different frol11 those of normal speeじh.. Esophageal. speech often incllldes some specific noisy sounds. which can be eas ily obscrvcd in lhe silcnce pa口s in the自gure. Thesc noises are pro duced through a process of gcncrating excitation �ounds， i.e.， pump ing air inlo thc csophagus and the stomach and rcJeasing air from them. Wavcform envclope and spcctraJ componcnts of csophagcaJ specch don 't vary ovcr an lItlcrance as smoothly as those of normal spcech. These unstable and unnatural varialions caus巴 the unnatu ral sounds of esophagcaJ spccch. Moreovcr， the pitch of esophagcaJ speech is generally lower and less stabl巴 than thal of normal只peech.. Some approaches have been proposed to enhance esophageal speech based on th巴 modification of its acollslic features， e.g.， lIS. Fo analysis process for normal speech oflen Fo extraction and the unvoicedJvoic巴d decision in esophageaJ. Consequently， a usuaJ fails at. ing comb自1t巴ring [1) 01' smoolhing [2]. However， since the acoustic. speech.. features of esophageal speech exhibit quite differenl properties from. degradation of analysis-synthesized sp巴ech quality.. These charactelistics of esophageal speech cause severe. those of nonnal speech. it is basically diftìcult to compensale 1'01' the. The intelligibiJity and naturalness of esophageal speech slrongly. acoustic differences belween lhem lIsing those simple modilication. depend on the skill of indi\'idual laryngectomees in producing. processes. More sophisticated and complicated processes are ess叩ー. esophageal speech.. lial to dramatically enhance esophageal speech. diflìcult 10 remove because lhey are caused by Ihe production mech. In this paper， we propose a statistical approach to enhanc ing esophageal speech using voice conversion (VC). [3， 4].. anism of esophage.ù speech.. OUI. 3. VOICE CONVERSION ALGORTTHM. proposed rnethod converts esophageal speech into normaJ speech ES-to. Wc dc，cribe a conversion method b出ed on maximum Iikelihood cs. We train Gaussian mixture models (GMMs) of the joint. timation of speech paramcter tr勾cctorics considering a gJobal varト. in a probabi1istic manner (Esophagea1-Sp田ch-to-Speech:. Speech).. However， som巴 specifìc noises are essentially. probability densities between the acoustic features of esophageaJ. ance (GVl [4) as one of the �tate-of-the-art statistical VC 11lcthods.. speech and those of normal speech in advance using parallel data consisting 01' utterance-pairs of the esophageal speech and lh巴 larget. This method consists of a training and convcrsion process. Morc. normal speech. Any esophageal speech凶mple is converted using. ibly controlling the con、'Cl'1巴川j speech quality.. 1'his work was suppo口ed in p‘Irt by MIC SCOI'E. The authors are grate fu1 to Prof. Hideki Kawahara of Wakayall1a L:nive"ity， Japan. for permission to use the STRAIGHT analysis只ynthesis method. 978-1-4244-4296-6/10/$25.00 @2010 IEEE. over. we also describe Ollc-to-many EVC [5] as a technique for flcx. 3.1. Training Process Lel us assume an input static feature vector X，. Xt(D，.)]，. 4250. and an outpul slalic feature vector. Yt. [:1'1 (1)、. = [YI ( l ) ，. ICASSP2010. Qd 司i 内4.

(2) 4品Z ' A AU ri151」 -卜1 -liriO K K K8 0 O K 0 0 0 0 nυ AU A V AO 内υ nυ nU FO η4 ・A -A 。& 6・z η& 'A 出 z s g u h Z 一N 尚 E 2 5 E 一 4 2 守〈日三喜喧 ω Fig. 1. Example. 合= ['ÎI� "・197司‘ー ρJjT is determined by maximizing the fol. 担nnal生盟主. lowing objective function，. 。=ö叩na.xP(YIX.入)ωP(叫ν)1入(ul) y suLjCCL LO Y = Wy. よ4i. where W is a window matrix to extend the static feature vector se quence into the joint feature vector sequence of static and dynamic features [7]. A balance between P(YIX，入) and P(v(y)1入(VJ) is controlled by the weight ω. ll川h|. of waveforms. spectrograms and. Fo contours. 3.3. One-to・ManyEVC. An eigenvoice GMM (EV-GMM) mod巴ls the joint probability den sity in th巴 same manner as shown inEqs， (1) and (2). except for a definition of the target mean v巴ctor wntten as. textual features of source speech， e，g.， the joint static and dynamic featur巴 vector 01' the concatenated feature vector from multiple frames， As an output speech feature vector， we use Yt = [凶ν，ム ν;rjT丁 c∞o∞ns凶I時of sta以山tic and dyr山nic features， Using a pan廿lel training data set consisting of time-align己d in put and ouゆut parameter vectors [X!.Yi]T. [XI.YJjT. [xi，Y引T， where l' denotes the 101al number of frames， lhe joinr probabilitydensity of the input and output parameter vectorsis mod eled by a GMM [6] as follows. '. (. 1I. whcrc bm and Am [am(l)，・ \αm(J)j are a bias vectOl and e明nvectors am (j) for the m'h mixture component， respec tively， The number of eigenvectors is J. The target speaker individuality is controlled by the J-dimensional weight vector 、'W(J)jT Consequently， th巴EV-GMM has target 山= [即(1)， m耐r-independent p釘ameter見1入(EV)consisting ofμjf)，Am bm.and :E��\:.\ï and target-speaker-dependent parameter w TheEV-GMM is 汀ained using multiple parallel data sets con sisting of a single input speech dala set and many outpul speech dala S巴ts including various speakerピvoices， The trainedEV-GMM al lows us to control the converted voice quality by manipulating the weight vectOl世J. Moreover， the GMM for the input speech and new target speech are flexibly built by automatically dete口nining the weight vector叩using only a few arbitrary utterances of the target speech in a text-independent manner.. I. I. 4. VOICE CONVERSION FROM ESOPHAGEAL SPEECH (2). where N(-;μ，E) denotes the Gaussian distl均utJon w油a mean vector μand a covariancc matrix :E， The rnixture compon巴nt index IS?凡The total number of mixture components is JH. A parameter set of the GMM is入，which consists of weights Wm‘mean vectors μ.�;.Y) and full coval;ance matr悶s :E�: .Yl for individual mixture components. Th巴 probabi口lηl町 d白臼E白I附1ザy ofthe GV v(ν) of the outpゆ uωr s引ra以tJωc f，己た a仕ν '[. lu川 r陀eve氏叫cは1ωtor凶s ν = [ν 7 ，yJjT over an u 旧rance is also modeled by aGaussian dis廿ibution.. '@•@ • .. P(切(ν)1入(吋) = N(v(y);μい'J， :E(り). (3). wl町E山e GVv(ν) = [v(l)，・・，V(Dy)jT is calculatcd by. 拾い)一品川)). (4). =. TO SPEECH (ES-TO-SPEECH). 1n order 10significantly enh，mce esophageal speech，it is essenti，ùto remove the speci自c noise sounds and to gcnerate smoothly varying speech parameters over an uttcrancc， Moreover， in order to syn thesize cnhanced speech with modi日cd spcech p紅白netcrs，it is also cssential to deal with difficulties of fcature cx汀action of esophageal spcech目To address thcs巴 issucs，wc propose statistical voice conver sion from esophagcal speech into normal specch， B巴cause conv巴此ed speech paramctcrs are gcncratcd from thc statistics of thc nonnal speech， the specifìc noise sounds and unstable variations are alle viated effectively by the conversion process， FurthemlOre. even if some specch paramctcrs such as Fo and unvoicedlvoiced informa lion are diffìcult to extract合om esophageaJ spccch， thosc parame lers exhibiting properties similar to those of normal specch would be estimated by the conversion from other speech parameters robustly extracted from esophageal speech (巴。g" spectral cnvelope)， Thises llmatlOn proce回may be regarded as a statistical feature ex廿actlon process 4.1. Feature Extraction in ES-to-Speech [8]. A parameter set入(叫consists of a mean vcctor μ(，.) and a diagonal covananc巴 matnx :E(V) [Y!. .. .Y;r. LetX = [X[，，，，，xi.. ，XijT and Y Yi jT be a tin問1 tωur陀e v巴ctωors， r陀esp巴ctively， Th巴 conv巴引rtほed s幻ta剖tJにc f，たeatωur陀e s鉛巴qu己nc巴. ，. (6). ) 、、1te，J I 2 2 r a μ. T Tt Y Tt x. 〆't 、 t、 ν M Zぃ11 リ川 m n 一一し μμ λ 「|lL れ = m … ぼp x μ. i ゃ ( λ二Y l :E�m;�l') 7ft =- I中川 I E�，�X) :E�γY ml. :E�;�' \'). 3.2. Convcrsion Process. μゴ) = Amω+bm，. of. め(Dν)jT at frame 1:， wher巴T denotes transposition of the vector As an input sp巴ech parameter vector. we use X t to capture con. ド(d). (5). The spcc甘al components of esophagcal speech vary unstably as mcntioned in Section 2， Moreover. spectral fcaturcs of some phonemes are often collapsed due to dif自culties of producing them in esophageal speech. To alleviate these issues. we use a spectral segment feature extractcd frorn rnultiple frames. At each frame， a spectral parameter vector at the current frame and those at several preceding and succeeding frames are concat巴nated， Then.dimension. 4251. nHV 口6 n〆UH.

(3) reduction with principaJ component analysis (PCA) is performed to extract the spec佐al segm巴nt feature Although it is difficult to ex汀act Fo from esophageal speech (see Figl1re 1)， we usuaJly perceive pitch infonnation of esophageal speech. Assuming that r巴levant infonnation is incll1ded in spectral parameter5. we use the spectral segment feature as an input featllre for estimating Fo in the conversion process. Moreover， in order to m法e the estimated Fo corrc�pond to the perccived pitch infonna tion of esophageal speech， as an Olltpllt featllrc we use Fo vaJlles extractcd from nonnal speech uttered by a non-Iaryngcctomce 50 as to makc its prosody similar to that of esophagcal spcech. TabIc 1. Estimafion accumcy vf me/-cepsfru川川TI削tT pOlVer wuj aperiodic componenTs. MeL-ceps Tr，日/ disTυrTion wiTh jJower (i.e.. in.. sholl'l1 in pω'enthes出 Mel-cepstral Aperiodic d1Stβrlion [dB] I distortion [dB). ι/uding the Oth co�fJicient) is. 846 (11.95) 4.96 (6.16). I I. 6.99 3.71. F o cor re lafion coefficiel1f (Ccm:) beT\Veen TIIl' ex Fo and the target Fo exrracted.from normal speech al1d ul1voicedlvoiced decisiol1引開仁For cxalllple， “VU" shvws rhe rafe of es timatil l g a voiced .fmllleαs al1凶n'oiced. UN dec凶ion error [別 I I Corr. I 内rable 2.. rracted/converted. 4.2. Training and ConversioJl In order to convert esophageal speech into normal speech， we use three different GMMs to estimat巴 three speech parameters 01' the target normal speech， i.e.， sp巴ctrurn. Fo， and aperiodic components that captllre noise strength of an excitation signal on 巴ach frequency band [9]. For the sp巴ctral estimation， we use a GMM to cOl1vert the mput sp巴:ctral segment features into the coπesponding output spec tTal parameters. For the Fo estimation‘we use a GMM to convert the input spectral segment features into the output Fù vaJues and urト voiced/voiced infonnation. For the aperiodic estimation. we use a Gl\1M to convert the input spectral segment features into the output aperiodic components.. I. ，I. Extracted. 円. ，. 0.07. I. ，、68 I. 43.82 (V-U: 42.60. U\i:卜22) 8.36 (\/U: 4.06. UV 4.30). 8 frames in both the spectral estimation and the apeliodic estimation and the current土16 frames in the Fo estimation. respectively.. 5.2. Objcctive Evaluations Tables J and 2 show estimation accuracy of spectrum， aperiodic cOll1ponents， and Fo. It is observed that the acoustic features of esophageaJ speech are very differ巴nt froll1 those of norll1al speech. Tbese larg巴 differenc巴s of the acoustic features are significantly re duced by ES-to-Speech. We can see that the proposed conversion ll1ethod is very effective for estimating any of the three acoustic fea tures.. ln synthesizing the converted speech， we design a mixed excita tion based on町estimated日and aperiodic components [9]. Then. we synthesize the converted speech by filtering the mixed excitation with the estimated spec町al parameters. 5.3. Perceptual Evaluatiol1s. 4.3. Applyil1g Ol1e-to・恥'1al1y EVC to ES-to-Speech. We conducted two op叩ion tests of inteJligibility and naturalness. The foJlowing six types of speech sampJes were e六faluated by 10 Iisteners. We fUl1hcr apply onc-to-many EVC to ES-to-Speech for flexibly controlling convel1巴d voicc quality. The one-to-many EV-GMM for spectraJ conversion is traincd using muJtiple parall巴I data sets of esophageaJ speech data uttcr巴d by the laryngcctomee and normaJ spcech data uttered by many non-Iaryngectomees. The trained EV GMM allows laryngectomees to speak in th巴ir favorite voices， which are created by manipulating the weight vector for eigenvoices or by estill1ating proper weight vaJues using onJy a smaJl amount of those voices' data as adaptation if they are given.. ES recorded esophageal speech. ES・AS analysis-synthesized csophagcal spccch. EstSpg synthetic sp巴巴ch using converted ll1el-ccpstrum， convcrlcd. apcriodic componcnts， and Fo extractcd f[om esophagcal speech EstFo synthetic speech using extracted ll1cl-ccps汀Ull1‘cxtractcd apcriodic components， and convcrtcd Fo CV synthetic speech using converted meJ-cepstrum唱∞nverted ape 吋odic components， and converted Fo NS・AS analysis-synthesized normaJ speech. 5. EXPERlMENTAL EVALUATIONS 5.1. Expl.'rimental Conditiol1s We recorded 50 phoneme-balanced sentences of esophageal speech uttered by one Japanese male laryngectoll1ee. We aJso recordcd the sall1e sentcnces of nOl1l1al spccch llttered by a Japanese maJe non laryngectomee. He tried to imitate thc prosody of山e laryngectoll1ce utterance-by-utterance as closcly as possible. Sampling frequcncy was set to 16 kHz. We conducted a 5-fold cross vaJidation test in which 40 utterance-pairs were llsed for training， and the rCll1ω111ng 10 uttcr白lcc-pairs were uscd for evaJuation. Th巴Oth through 24th meJ-cepstral co巴f自cients extracted with STRAJGHT anaJysis [1OJ wer巴 used as the spectraJ parameter. As the source excitation features of normal speech， we used 10含scaled Fo巴xtracted with STRATGHT Fo analysis ll l] and aperiodic com ponents 19J on five frequency bands， i.e.， 0- J， 1-2， 2-4， 4-6， and 68 kHz， which were used for designing mixed excitation. The shift length was 5 ms. We pl官liminarily optimiæd several parameters such as the num ber of mixture components of each GMM and the number of frames used for extracting the spectral segment feature [8]. As a result. we set th巴 number 01' mixture components to 32 for each of three GMMs. For the segment feature extraction， we used the cu汀ent土. Each listener evaluatcd 120 sall1pJes' in each of thc two tcsts. Figur官s 2 and 3 show thc result of the intelligibility test and that of the naturaJness test， respectivcly. ES-AS causcs significant intcl ligibility degradation compared to ES duc 10 the difficulties of the acoustic fcature extraction in esophageal speech. The specific noiscs and unstable variations on the sp巴C住ogram 111巴sophageaJ speech are signi自cantly aJJeviated by using the estimated spectral features Moreover、the conv官rted spe巴:ch exhibiting pitch information sim ilar to pitch p巴rceiv官d in esophageaJ speech is generated by using the estimated Fo contour. AJthough significant improvemenls in in t巴lligibility and naturalness are not observed when using only one of th巴se e坑imated features， we can see that the ES-to-Speech (CV) estimating all acoustic features yields signi日canlly lllore intelligibJe and natural speech than esophageal speech These results suggesl that the proposed ES-to-Speech is very ef fective for improving both the naturaIness and lhe intelligibility of esopha.geal speech IS引eraJ samples副e availabJe frol11 http://spalab.naist.jp/hironOll-cVJCASSP/ES2SPlindeλhtml. 4252. TIE n6 っ“.

(4) 95% CollÍïdence. interv，ùs. む宮吾品E〈. - I IH. I. 2f.. . . .1ぞ ES-AS. EstSpg. EstFo. CY. 5 2F邑h出 ]h υ N目白 {. ノ�I r/é�.ヰrょう ES. NS-AS. Fig. 2. Mean Opillioll SCOI官Oll illtel1igibility 。. nu nu nu aAZ nUAυ nU ハU a仏τ 04 唱i. �. zωロr日片山 hu N-12} {. 串仁. i31. '且'ι K -K nunu nUAU ハU n6 21 4 3. 5. 50. 0. U. 0.8 Ti血e [secl. 0.4. 1.6. Fig. 4. Example 01' wavefonn， spectrogram. and }ò of the cOl1verted speech by one�to-many EVC. spef'ch illtO spectmm.. Fo， and aperiodic components of nonnal. speech independently using three different GMMs.. The experト. mental results have demonstrated that ES-to-Speech yields Sigllifi cm1t improvements ill illtelligibility and naturalness of esophageal speech.. Moreover， we have also applied olle�to�many eigenvoice. conversioll to ES-to-Speech for flexibly COll町ol1ing voice quality of. 1. the converted spe巴ch.. NS-AS. 7. REFERENCES [1). 5.4. ElTectiveness of One-to・manvVC. lIsing a comb fìlter."lmernational Conference on Disability， Y inual Re ality and Associated Technologies， pp. 39-46. 2002. To makc voicc quality of the convc口cd spcech simiJar to the Jaryn. [2). gectomec‘5 OWll voice quality. we applied onc�to�many EVC to ES sets consisting of the esophageal speech and. 1 834， Phoenix， Arizona. May 1999. 30 speakers' nonnal. |β3J. speech. The EV-G恥1M was adapted to an esophageal sp巴巴ch sam ple shown in Figllre 1， and then the esophageaJ speech sample was. Spワeech. alld AI“Idio Pro. Yo叫l.. maximum likelihood estimation of spcctral parametcr tr句ectory円IEEE. Fo 1) acoustic features of. 7ト即日ASLP， Yol. 1 5. No. 8. pp. 2222-2235. Nov. 2007. [51. the converted speech vary more stably than those of the origillal esophageal speech sample出ld. tran目sfor口rmη、 for voi町ce convers剖】on，" IEEE 7罰ワrans.. 6， No. 2， 1 'p. 131-1 42， 1998 [4) T. Toda， A、11. Black， and K. Tokuda， ‘Yoice conve円ion based on. Figure 4 shows an example of waveform. spectrogram， and. T.. Toda， Y. Ohtani， and K. Shikano守. ‘One-to-many and many-to-one. voice conversion based on引genvoices，" Proc. ICASSP， pp. 1249-1252，. 2) the specific noise sounds afe sig. nific，mtly ，ùleviated by the conversion process.. Y. SIりげylia叩n叩l(旧o川1I， O. C“叩ppe， and E.. Mo山ulin即1暗es.. c、es口sr附ng，. converted inlo nonnal speech using the adapted EV�G恥1M We can see that. K. Matlli. N. Hara， N. Kobayashi. and H. Hirose， "Enhancement of esophageal specch using formam synthesis，" Pr.οc. ICASSP， pp. 1831-. to�Speech. We trained the one�to�many EV�GMM using parallel data. of the converled speech.. A.H】抽出and H. Sawada. "Real-time c1 arificalion of esophageal speech. Hawaii， USA， Apr. 2007 [6). Namely， evell if. A. Kain and M.W. Macon、"Spectral voice conversion for texl-to-speech synthesis." Proc. ICASSP， pp. 285-288， Seattle， l1SA， May 1 998. esophageal speech is used as the adaptation data. the adapted EV. [7). G孔仏1 provides the converted speech of which properti目立re Slm1-. K. Tokllcla.. T.. Yoshimura. T. Masuko， T. Kobayashi. and T. Kilamura.. '.Speech paramet町generation algorithms for HMM-based叩eech !:.yn. .. lar 10 those of llonnal speech. In addition. we have observed that. the日5 ." Prac. ICASSP， pp. 1315-1 31 8， I st anbuJ TlIrkey， June 2000. the adapted EV-G孔仏1 makes the conve11ed voice quality clos巴r to. 18) H. Doi， K. N玖amu問、T. Toda. H. Saruwatari， and K. Shikano.“En hancement of Esophage立1 S1'eech l1sing Statistical Yoice Conversion，". the laryngcctomec's voice quality compared with the GMM lIscd in. APSIPA 2009.1'1.' 805-808， Sapporo. Japan， OCl. 2009. [9j H. Kawahara， J. Estill， and O. FlIjimllra，“Aperiodicity extraction and. the previous cvaluations， which wcre trained using the single 110n laryngcctomec・s spcech. FlIrthermore， w巴 have also observed that. control using mixed mocle excitation and !，'TOUp delay m，mipulation for. the converted voice qllality is flexibly challg巴d by manipulating the. a high quality speech analysis， modifìcation and system STRAIGHT，". weight valucs of the EV�GMM. The proposed ES-to-Sp巴cch with. MAVEBA 2001， Florcncc， Jtaly， Sept. 2001. one-to-mally EVC is cxpected to make possible a new speaking aid. [10) H. Kawahara.. system that allows laryngcctomees to control the converted voice. l.. Masucla-Katsuse， and A. Cheveigne.“Restructu口ng. sp�ech representations using a pitch-adaptive time-frequency smoothing and an instantaneolls-frequency-based fò extraction: Possible role of a. quality as they want. repetitive structure in sounds，". 27. No. 3-4.. [1 1) H. Kawahara. H， Kalayose. A. Che\'cigne. and R. D. Patterson.“Fixed. This paper has presented a novel method for enhancing esophag巴al spe巴ch using statistical voice conversion.. Spnch Commwtication， Vol.. p1'. 1 87-207.1999. 6. CONCLUSION. point analysis of freqllency to instantaneous仕equency mapping for ac. The propos巴d method. curate 出timation of. (ES-to�Sp巴ech) converts a sp巴ctral segment fealllre of esophageal. Fb. and periodicity，" Proc. EVROSPEECH. pp.. 2781-2784， Budapest. Hungary. SepL 1999. 4253. 円〆“ nku つ'“.

(5)