Creating Speaker Independent HMM Models for Restricted Database Using STRAIGHT-TEMPO Morphing

全文

(1)CREATING SPEAKER INDEPENDENT Hf\在h在h任ODELS FOR RESTRICTED DATABA SE USING STR A IGHT-TEf\任PO MORPHING Alexαndre Girardi， Kiyohiro Shikano， Saloshi Nakamurα1 1 Nara Institute of Science and Technology Takayama-cho 8916-5， Dcoma-shi， Nara-ken 630-0101 Japan _E-mail: [email protected] c.jp. ABSTRACT.. casc like morphing dala towards childrc口 voice data.. In speaJ問r I 口dependen1 speech recogni1ion， one prob Icm wc of1en face is 1he insufficicn1 da1abase for 1rain i o g . A system 1raioed only with scvcral male and femalc dalabases will likely lack lhe iロformalion which is prescnt in speakers wilb diπcren1 pi1cb and vocal trad Icngths. In回目trεme casc， childrcn databぉe， tha1 is nol εasy lo. STRAIGHT-TEMPO [1)[2いs a combioa1ion of three basic tools: STRAIGHT， TEMPO 日d SPIKES. With STRAIGHT (Speech Transformation and Represenlalio口 using Adaptive lnterpolatio日。f weiGHTed speclrogram) we c日 construcl a speclral time-frequency envelope of tbe. solve the abovc problem， we sludy lbe elfecl of a combincd. signal，frcc of the periodici1y elfects in 1ime-frequency日a1ysis. Tbis timc-frequency cllvclope can be manipula1ed，for examplc in 1he time and frequcncy domains. As 日 addi1ional input STRAIGHT uses the FO白1ima1ion obtained with TEMPO (Time-domain Excitation ex1ractor based on. cb a..n ge in tbe pitcb and speclral frequency strelch of tbe original ut1erances in lhe dalabase， in ordcr 10 construct more robus1 HMM aεouslic models. We study this elfcd. men1al frequency FO can also be manipulated and togetber with the manipulated spectral time-frequency envelope ob. ob1ain， is a good example of how dilferen1 pilch and spcc tral frequency slre1ch a.lfecls in a rεaI spcaker indcpendenl spcech recognilion system. ln lhis papcr，ぉ an approacb lo. Minimum Perturba1ion Operalor) algori1hm. The funda. by coos1rucling morpbed speaker da1abases which are con. tained i日STRAIGHT we c日 syn1b巴ize a morphed version. verled from avおlable male and .female dalabases to targc1. of tbe original吋nal using SPIKES(Synthetic Phase lm. female 日d children voices respcctively.. pulse for I<eepillg Equivalent Sound).. Using the mor. pbed database， we anaJyzed the leveJ of improvement that can b ε obtained， in terms of recognilion ra1e， compared with the real database. Tbc recognition rate of tbe femaJe VOIαrecognized witb malc models improved from 60.2% to 87.3% witb tbe morphed models. Tbis result attests the elfccliveoess of tbe proposed me1hod and tbe slrong in. !lueoce of 1be combined pconv (pi1ch conversion rate) and f conu ((requency co日versioo ra1e) have in 1be quality of the acoustic models. The resuJt a1so a11es1s 1he useful ness of tbe proposed algorithm for es1imating pconu and f con u. ln this paper，口periments witb male and female. The following sedions will describe the way we estimated the pararneler necessary to morpb the adult data 10 chil drcn voice using STRAIGHT in seclio日2. The second sec tion describes the data base used in section 3. The t�ird seclion describes the experimen1al rcsul1s in sec1ion 5，回d the final section commenl th� rεsults obtained in section 6. 2.. CONVERSION RATE ESTIMATION. data morpbed towards 6 children voiccs are carried ou t. Tbc isolatcd elfecl of. 1.. pco川日d fconu. is aJso rcported.. INTRODUCTION. One of tbe major probJems in LVCSR (Iargc v ocabulary contiouous speecb recog口ition) systems is tbe insufficient amounl of trainiロg data. Tbis problcm is evc日more scnsl. tive wben we lalk aboul cbildren dala.. At the samc time， recently a high-quality morpbing al goritbm narned STRAIGHT-TEMPO 11] [ full dclailed de scription can be obtained in tbc draft paper for CASA-IJCA1， 1997 in http://www.hip以r.co.jp/�kawahara/STRAIGHT.html J， wilb cx ce ll enl morphing qualities capable of cven 600% ma oipulation of speech paramelers such as pi1ch， vocal 1racl length 日d sp日king ra1e， while kecping lbe human likc. PITCH A ND FREQUENCY. This scclion describes thc. pconv. (pilch conversiou rale). and f conv (freque口cy conversio日 rate) es1imatio日 a1g守口thms used in this paper. pconv and fconv are thc average rale of conversion for lhe fundamen1al frequency and spec-. 1ral slretch be1ween two speakers. We estimated pconυand fconu from the vowels par1s of speech. The vowels werc. obtained by using HTK lo get a1ignment of the utterances of bolh base and target spcakcrs. Som巴addi1ional cares have been 1ake u:. •. •. jusl aligロments 1aken from correct recogni1ions wcre us巴d. alig n cd vowcl samplcs havc been encrgy normalizcd before proccsscd. naturalnロs of the voice. We dεcidεd 10 use STRAIGHT- TEMPO to incrca.se 1bc da1abasε diversi1y， by adding tbe usuaIJy large arnou口1 of adull voice da1a morpbed lowards children voice data. We �valuale how STRAIGHT-TEMPO pcrforms i n a exlrcme. ICSLP'98. To presen1 1he exprcssion we usεd 10 es1imate. fconu. pconv aud. we will firsl introduce some no1a1ion.. C onsidcr a pむr of speakers (z， y)， where lhe speakcr x is lhc ba.sc speakcr日d 1be spcakerνis lhe 1argcl spcakcr. 979 147.

(2) both spcakcrs wc cxtracl thc rccognizcd vowcls parts， producing veclors of fundamcnta.l frcqucncy FO.(t) and �pcclrum IA.(t，1)1， whcrc t is thc time of tb口町plc日d f 悶the (req'uen�y i� the spcctrum for the speaker J. FO.(t) and IA.(t， /)1 arc obtむロcd from STRAIGHT by a.pplying pcoπυ= 1.0 a.nd fconv = 1.0.. 4.. Frorn. PARAMETERIZATION. Ta.ble 2 describes lhc expcrimεnl conditions(para.meter iza.tion uscd) for lhe fo11owiロg cxperimenl a.s weU a.s for lhe rcst of the experiments in this pa.pcr. Table 2:. ALgning thcse Tccognizεd vowcls belwccn bolh speakcrs Z札口d y， wilh vowcls of samc kind we gel pa.irs of ra..ngcs for thc fundarncnla..l frcqucncy. Pa.rametcriza.lioロ. (FO.(t..(p)， i..(p))， FOパlv.(p)，lνε(p))). 日d for lhc spcclrum. AU the recognilion ra.les il1 the ncxt section refcrs to ticd rojxture modcls. 55 phonemc roodcls wcre uscd. (ん(t..， t..(p)， 1)， Av{tv.(p)， lv'(p)， J). 5.. ".herc p is lhc pa.ir numbcr with pε1，..，P日d P is thc. EXPERIMENTS. rn;u:irnum nurnber of pa.irs wc c日以jgn within the a.va.ila.blc da.ta... The fìrst indcx of thc tirnc t refcrs lo thc spca.kcr回d lbe sccond to lhc starl s日d lhc cnd e of thc alignmenl rcspcclively.. The firsl expcrimcnl a.ims lo confìrm lhe reliability of lhc cstima.lion of pconu (pitch convcrsion ra.le) a.nd fconu (frequency convcrsion ralc). The secoロd experimenl wiU eva.lua.le how lbis a..lgoritbm works when a.pplied to children Using this a..1jgned pa.irs of vowcls we ca.n estimale pconυ data.. and fconυ， a.ccording to thc following equa.lions (1) ， (2)， (3) a..nd (4) 5.1. PITCH AND FREQUENCY RATE ESTI1\在ATION 凡(p). =. t.，(p) - t..(p). Nv(p ). =. tv'(p)ーら.(p). p…=. (1) (2). To evaluale pconu 日d fconu csliroa.lion a. mal巴(MHT ATR SetA [3]) speakcr w出morphed lowards a. fema.le voicc with a.ppropria.tc consta.nt pconυ a.nd fconu para.mcters， gencraling a. morphcd fcma.le voicc(FMHT).. (3). Tbe valucs of pconu日d fconu which a.re us巴d lo convert MHT speaker da.la. lowards FSU speaker da.la.， a.s weU a.s lheir tccstimated va.lucs， arc shown in Ta.ble 3.. L. FOv(t). p�J"P Nコj;j. Z. FO. (t). N.(p). Z. 2二 f log 1ん(t ，1)1. z. 2 ンlog IAv(t， 1)1. 狩百j;j. 占 2二. f P戸�J"P. 万百予苛7. '�'..(p)"'.，(p). '�'v.(p)"'v，(p). Reeslima.led pconu and f conu a.re close to va..lues uscd for conversion of MHT to FMHT. Table ・3:. l l |. I. (1 ). 3.. DATABASE. Tbc voices of 6 childrcn have bcen rccordcd a.t 48kHz a.nd downsarnplcd to 16kHz. Tbe a.ges 日d gcndcr of the childrcn a..re shown iロ Ta.blc 1. The 210 words uttcrcd by 亡hildrcn are common use words for childrcn in tbcir a.gc・ Ta.ble 1:. Paramelcr pconv fconv. [ I 1. Used 2.22 1.25. [ 1 1. Eslima.led 2.17 1.27. [. 1. The word recognilion rale using MHT， FSU a.nd lhe morphed FMHT dala. a.re shown in Tablc 1. Models are lra..inεd wilh lhe odd nurobercd words and lcslcd a.ga.insl evcn numbercd words of lhe ATR SdA dala.ba.sc. Table 1: Word recognilion rale(%) FMHT is lbe MHT speaker morphed lowa.rds FSU spea.ker. Children da.ta.ba.sc dcscriplion. From lhis resull we conc1ude tha.l the morphed dala.， a.pproxjma.lcd lo lbe rea.l fcma..le da.la.， increa.se lhe recog nilion ra.te from 60.2%も087.3% Thc a.dult da.ta. we used wa.s lhe MHT a.nd FSU 5240 words a.nd MHT， MAU， MMY， FKN， FSU and FYM 216 ba..la.nced words of ATR SelA [3].. Also for an adull ma.lc voice morphed towards a fema.le vOJce， pconυa.nd fconu estiroa.tion divergc only 2%， atlcst ing lhe robus lncss of lhc a.lgorilhm.. 980 148.

(3) 5.2.. MORPHING ADULT DATA TO. Seco口d， adult voice wa.s recogn】zed with lhe model generated {rom the remaioing adult voices ( corre・ spond to the first row io Table 6 aod Table 7 ). •. WARDS CHILDREN DATA. .. The nexl experimenl aims lo evaluate thc cfficic日cy o{ STRAIGHT morphing adult voicc lowards children voice We used 6 adult speakers (3 male， 3 (emale ATR SetA) wilb 216 balanced words eacb. For comparison we used 6 cbild同日(3 male， 3 (emale)， each childre日 uttcred 2 10 words containing all the Japanese phonemcs.. •. Thrce types o{ acoustic modcls were created: •. the lirst wa.s by using ooly childrcn voices. tbe tbird by using only morpbed data.. Tbe estima1ed pconv日d fconv bc1ween 6 adults aロd Results c回bc seen iロ Table 5 6 children bave been evaluated.. Table. l. 5: pconv祖d fconv betweeo speakers. e�� speakÚ. írom speak. '. ‘'"・101 01 02 03. 11. pcoov. Fourtb， childreo voicc wa.s rccognizcd with thc model generated {rom adult voices morphcd towards the chil dreo test data( correspond to the values after a back slash in Table 6 and Table 7 ).. All cxperiments are carricd out withio thc same geodcr 日d age. Tablc 6 and Table 7 shows tbc recogoition results. The first Ictter o{ the speaker name represents its geロder， whcre male and (emale are represcnted by M日d by F， respectively.. ・1he second by using only ad ult voices. •. Tbird， childrcn voicc wa.s rccognized with the model gcnerated from adult voices ( correspond to values that印the central part o{ Table 6 and Table 7 ).. •. 5.3.. FIX PITCH AND FREQUENCY CONVERSION R ATE. 10 oder to compare the est】matioo o{ pconv and fconv with the optimum values， we carried out recognitioo us ing models morphed io steps o{ 0.05 for each parameter， close to the estimatcd values. T bese recognitioo r目ults are showo io Tables 8 and 9.. I (coov I. 11 2.08 1 1.19 2.22 1.28 1.43 1.12 1.02 1.29 1.06 1.24 0.9 1 1.12. Table 8: Confirming pconv aod fconv male estimations by carryiog out recognition usiog near 10 optimum pconv and f印刷. Recognition ratc is cxprcssed in word accuracy. (男). •. Tablc 6: Word recogni1ioo rate (%) using "malc adult cnodel"r/"morphed model�). child. MCHOl MCH02 MCH03 fCHOl fCH02 FCH03. 85.2 59.5 90.0 73.3 70.0 73.3. �汁. 100.0 17. 1/68.3 1.4 17.1 4.3 1.4 8.6. ai五百. ロA1マ. JVnVIY. 98.2 4.8 1.0/39.2 7.6 2.4 0.5 5.2. 98.2 18.6 2.4 22.1/63.8 1.8 0.0 14.8. Tablc 7: Word recogoition rate (%) using "(emale adult modelnr/"morphed modeln). FKN. child. MCHOl MじH02 MCli03 上、CHOl FCH02 FCH03. 85.2 59.5 90.0 73.3 70.0 73.3. 95.4 65.7 39.5 71.9 49.1/49. 1 28.1 35.2. adult. FSU. FYM. 96.3 67.6 32.4 69.1 43.8 19.1/31.0 4 1.0. 93.5 6 1.0 45.2 68.1 5 1.4 37.6 4 1.4/33.3. Four expericnents were them performcd:. •. F irsl， cbildren voice wa.s recognized wi1h lbe model gencraled from the remai口ing children voiccs ( corre spond to the lirst coUum io Table 6 and Table 7 ).. ICSLP'98. The lixed pconv and fconv experimcnts resulted in the highest recognitio口near the optimum values ( see Table 5) obtaioed with lhe estimation algorithms ( equations (1) 叩d (2) ). This attests the robustness o{ the proposed algori1hms. For a male data morphed 10wards children dala， a higher increase o{ the recogoi1ion ra1e was achieved， while the {cmale data prescn1ed almost no significant improvemcnl. This shows that additional degrees of manipulation are nec cssary to morph adult data towards children data.. 981 149.

(4) τ冶bJc 9: Confirming pconu a.nd f conυ{crnalc cslirnalions by ca.rrying oul rccognilion using ncar 10 opljmurn pconu 日d 1conu. Rccognilioo ralc is exprεsscd in word accuracy. 7.. REFERENCES. 1. H. Kawaha.ra. Speech rcprescntation aod transforrna lion usiog adaptive iolcrpolalion of weighled speclrurn. Vocoder �eviscd IEEE inl. Conf. AC01LJ1.， Spccch and. (%).. Signa/ PrOCCJJ.， vol2， pagcs 1303-1306， Muenich， 1997. 2. H. Kawahara and de Chcveigne. Error free fO exlraclion rnelhod and its evaIualioo. Tech. Report 01IEJCE， SP・ 96-96・9.18， 1997. (in Jap日邸宅).. 3.. H. Kuwabara，. Y.. Sagisaka， 1<. Takedi\， and M. Abe，. uConslruclioo of ATR Japanese speech databa.se ぉ a resea.rch 1001，" TechnicaI Reporl TR-I-0086， ATR，. 1989. (in Jap回目巴).. 6.. CONCLUSIONS AND FUTURE WORK. Tbis papcr preseoled. <1.0. allcrnalive way 10 incrcase lhc. dal.abase {or HMM acouslic rnodcl geoeralioo by using thc bigb.quaJity STRAIGHT- TEMPO algorilhrn. Morphiog adult dala lowa.rds adull data lbc a1goritbm increased lhc fcmale voice rccognilion r<l.lc using rnodcls lr;u口cd wilh rnalc dala úorn 60.2% 10 87.3% wilh rnorphcd dala Tbc algorilbrns proposed {or pilch and frcqucncy coo vcrsion rale eslimalion provcd 10 bc robusl {or adull dala. Thc incrcぉe in lhe word rccogoilion ralc {or childrc日 dala， ....b . eo adult dala is morphcd lowards cbildren dala， alt.csls lhe usdulness o{ the proposed rnclhod for bolh rnalc aod fcrnalc adull dala.. AduJl da.la. is morphcd lowards chiJdrcn dala iucrca.se lhe word recognilion ra.lc {or cbildren da.la.， wbich allesls lhe uscfulness o{ lhe proposcd melhod {or bolh rnalc a口d {cmalc aduJl di\la. These way la.rgc arno\lols o{ i\dull rnalc 日d fcmale di\la can bc rnorpbcd 10 mi\lch childrco dala，，，-'hilc cacb cbildrcn only occd 10 rccord small amounls of words cacb. Iロlhe fulurc we pl日10 invcsligalc a non lincar frc qucncy conversion，出wcll a.s a rnorεrobusl eslirnalion of lbe frcqucncy coovcrsion ratc， by adapling lhe frcqucncy range uscd 10 {oUow lhc {rcqucDcy convcrsion oblained. ACKNOWLEDGMENT. This work is supporlcd by CRES T ( Core Rcscarch for Evolulional Sciencc and Tcchnology ) ， JAPAN. 982 150.

(5)