Automatic Recognition of Nasals 利用統計を見る

(1)

1

Automatic Recognition of Nasals

MinoruShigenaga HitoshiAriizumi G昭和39年9月10日受理） ’ ．．．．． Amethod of recognizin’g l m l and l n l in monosyllab1es and words in．エeal『time is teported．．The method consiSts．of three parts， which’perform segmentation Cf nasal consdnant， discrimination between l m l． and．．1．n l’and recognitibn of the’following vbweL In order to extract the nasal part，『by comparing the 6utput of 300 c／S’LPF lwith that of 500N1600 c／s BPF l e， a， o，u， w l are excluded from nasals， by comparing SeON1600c／s with 2300■−5000 c／s frequenqy・ranges l i， j l are excluded． voiceless stops are ea5ily olnitted ・by comparing the ou『put o．f’”300 c／s LPF with that of 700 c／s HPF・bUt this circuit is also u5ヒd for excl、udihg vow￠ls・Foエexcluding voiced stops， fricatives and flapPeds fundamental f・・qu・n・y・・mpOn邸・・e r工仕・・ted・ft・・fi1・・ri・g・p…h waves by 700・−y3QOO 9／・BPF劔 the parts svhiqh the−output’s．壱RiSt’continuously are considered to l〕e likely nasal＿The．Parts of loWer leve1．of ，ori呂ina！sp巳e￠h waves are excluded． The segment which s4tisfie日．．．， these five condit・ions’is decided・，to・be・nasal consonant afteエ leaving out 12．ms of the onset・ AccOrding to this method．the initi41 part of the nasal is often・missed largely but the bOundary betwe6n the pasal consona耐and the following vowel is pointed out explicitly． ’Concerning tO discrimin’ate betwe’eri l m l and l n l，the components of two frequencies just・．before the boundary between nasal consonaht and the’following vowel ate cOtnpared for distinguishing lエni l and l me l from l ni l and lne l respectively． For the nasals followed by l a， o， u l，F210ci are 1ユtilized directly． Though b巳tweell l me l and l ne l in words aエe not discriminated satisfactorily as a result of indiv三duality， the others are recognized over 80％for．three maleΨoices． § 1， IntrodLuctio皿 It has become possible to recognize l m；and l．n i in Japa皿ese menosyllables and werds in real．time． This paper describes the progress about them． ’1 1n producing the nasal eonsonant， the nasal cavity is coupled to thg pharynx by lower− ing the sott palate and the・oral cavity is closed at the lips or somewhere and breath is．uttered frOm the nostrils． The皿ost remarkable characteristic of the nasal consonant is the e垣stence of．the zero in the tran5fer fロnction resulteCl・from clOsing the−oral cavity．．Moreoveエthe’nasals lm1”， ln．l a血d lol respectively change the positions closing the oral cavity and carkse the zeros．． to be at different places， so that their discrimination must be possible，during pronounciation． Accord三ng to the synthetic experiments of the nasals， however。 td select preper F，’］ocu§’in the following vowel make possible to identify lml， lnl qnd l91

even ifthe same frequency spectrum pattern is used in the part of lml， lnl and

lI］1．1）・2〕・． And it is not necessary that any zero is taken into consideエation in synthesis of the nasalsの． FOr，．it is doughtful whether aural rnechanism］perceives zero ’its’elf， and also the zero cancels somewhat the forrnant afound it， in addition to reduction by the low first fortnatit． So energies in the midd・le and』high frequency’ranges be己ome pretty．，．weak and the．ゴ06

(2)

’Automalic Recognition of Nasa！s （Shigenaga ・Ariizumi）． P・・r・pti。n。f n・・al・d・p・nd・m・inly・n th・fi・もt f・・m・nt． Th・di・c迦nati’・n・f固，．国 and i O l may be made by the difference ef frequency spectra in the parts of‘transition”to the vowels・And also the vowels l u l， l i 1，simnar to the nasa1， have the first formants at about 250−350c／s． These facts make diffidult．not only to reCognize lエn i， lnl but also to extract the nasal part……rapidly and ce士tainly． In recdgnition it is the I皿ost impor− tant t・・xt・act cha・acteristic feature・but’it is als。 necessa・y t・emp1・asi・e’them and supre・s unnecessary parts・We are not yet able to extract nasal patts・completely， btit．try to realiZe the above principle using as simple circuits as possible， and have made aエecogni乞er co且sist− ing・f 3 P・・t・・whi・h p・・f。・m・egm・nt・ti・n・f n・・al part・， di・c・imin・ti・n b・tw・en i m l and」lnl壬or the na5al parts and’ 窒?モ盾№獅奄狽奄盾氏@of the following vowels． In the extraction”of・nasa1．partS， exclusion of voWel ・iu．1・is difficult and so・it is impossible to extract all of nasal parts but it can．show clear1Y the botindaries・between hasal corisonants’・and the following・vowels， which are・n’ Ucessary in・discriminatiOn． of・1ml，’ lln 1・Yet， the・beginning an・d・end parts of vowe1・i・i l and es’pecial・ly． the end of l u小sho、v frequently the．informatien of nasa1．工n discrimination Of． l m 1， ln．l th6’results attain t．o more than・8q％ for three male voices． §2・ Segme皿tatio皿of na8al．COfiso”apts The characteristics of nasal consonants aτe the following4）1’t“）．（1）Large energies at about 200−300 c／s．（2）The second， third ahd for丑1 P61eS lie at about 1000 c／s，2000 c／s． and 3000 c／s respectively and there is anotheエ’o旦e pole due． to’丑1e exsistence of zeエo ；these・ levels are low．but．as a nature of v‘コiced sound8， the waveforms inゼhis’regiOn ere periodic．．（．．3）The existence of zero caused by closing the oral cavity． The’”freqtiencies・are about

750−1250c／s foエim1，．1450−2200c／sfor lnl’andh三gherthan 3000c／sfor luiwhich−

are prineipally decided by the form・of mou出、．， And during Pro加unciatidn出e fo士m of the．． oral cavity changes，ゼhe士efore the．zeエo point also varies．．Owing to．the existerice of this zere，、 the frequency spectra higher than 800 c／s become complicated・and unstable． The皿asal poles・ as well as’the ．zero， without the method such’as・・analysis by’synthesis， Oannot be extractea simply疏d p士eclsely fro皿，the freq賠ncy spectra themselves． BedaUse of．these facts the reCog− nitio亘三S difficult on．acoustical level．・．．” ．．．’ ．．． ’ ． N。w，．th…peech w・v・filt・・ed by 700・／・HPF・h・w…1・tiv・iy曲卿t fi・e in・mpli− tQde；We・decide．there；as the boundary between the nasal and’．the following’．ヤowe1， but there are mahy cases in which it had better say that the vowel・has begun ｛ro血．．about’a． periqa earlier． Then using above na加res，，we consideゴto exclude the otlユer phqpβme与． in：・order．（．i） Pisc工i耳1inating the nasals from也e vowels i e・．a・o・…1u l and ．lw1： ComParing th・、．fr・qu『・・n・・gie・brl’pW．．300・／・with U…e・f 500ヨ90g rイ・・i頑・U・・a勒・f・・mer c迦P。綱亭．・・巴．1ar9・・中・n tbe．1・tt・、；・・1．．．、… 、．．・．・．・

．（ii）D興輌輌g f・・mい，，・M・C・mpゆg th・frequ・n・y・nergi・・。f 500−1600・／・

With． those．of・2｛300r−5000 c！S， i耳，、the’．nasals the fo耳亭er．becomes large工・．、、・・．，’・．．・．（iii）’・・iDiSC，iliniii・ting，、f・。m、寸・i己・1・・s．．．・・n・加ant・・1・．BY・・mp・・三ng． the・f・equ・n・y・。m〒 ” ヱ0ク

(3)

Dec．1964． Reports iof the Facdlty of Engineering；．Ya血anashi University

_No．15

ponents below・300 c／s with those above 700 c／s， voiceless consonants can be excluded easily． The exclusion of voiceless consonants can be also completed by the following（v） circuit， but it is difficult to exclude all parts of vowels certainly by only the（i）circuit excepting the vowel lal． Especially， japanese vowel l、u l is very unstable in the second for血ant region． In the case of succeeding the nasals， it appears clearly but especially in the end of words is extremely weak． For this reason this circuit is used to e文clude vowels， especially lul，in addition to （i） circuit． In above circuits， we haVe designed so as to emphasize the difference in amplitude between nasals and vowels and make clear the boundary by using a variableμ tube in amplifiers． On the other harld the amplitudes above a certain level are clipped． Still， in order to make sure the exclusion of the vowel parts the output of 500∼1600 c／s in the（i） circuit or of 700c／s HPF in the（iii）circuit， when it has exceeded a certain leve1，・is fedback to the integrator in the 300c／s LPF circuit so as to make discharge the stored charge till the static voltage level． This operation is useful for the exclusibn of lul，to Say nothing of Le，と， o｜，and lfor emphasizing the boundqry t6 the nasal part．（iv） Discriminating from voiced stops， fricatives and flappeds ．：Though these conso− nants have usually buzz， the、high frequency noises included in the buzz are very weak or not so periodic even if they were predomihant． So extracting the’fundamental pitch frequency ・after filtering the・speech wave．by 700∼3000 BPF， the part in which two or more pitch 丘equency outputs continue regularly is considered as a nasal part． In this’circuit the gain of the・amplifier is regeneratively controlled by the envelope of thC original speech wave and also high level part．s．of vowels are clipped． −』（v） Exclusion of low level parts： The ending of the vowel， especially l’u l，where the amplitude is・low happens to show nasal iIlformation， so the circuit which excludes the parts・of low level has been added． The above．five conditions are put into an AND circuit and if the output continues over 12 ms from the onset， the part is decided as a’nasal segment．・If this method is used，止ebeginning of the nasal part is largely， missed owillg to the exclusion of l u．1 but the end of this part．will show explicitly the boundary between the nasal and the vowel or l j，1． As stated later， in the discrimination of l m l， l n l the frequency spectra before and after of the bourldary become so importaht that the above method is sufficient・to per− form our intention． The block diagram is shown in Fig．1． Some phonemes except nasals hre often excluded by two circuits or more． §3．］Discrimi皿ation between l m l and l皿1． Th・m・・t・・m・・k・b19 differeng・b・tween l m川and i；・ 1． i・in th・ft・qugn・i…f zer・・du・…h・dissimi1・・i・y・f・h・．P・・i・i・n・c1・・ing・h…al‘ ’・aVi・y・But t・e緬C・ these zeros is not easy in real time． On the other hand， from「the view bf analysis and sYnthesis we know“the differende is in the part of the’transition to’the followi’hg ’ vowei． By ＋_{short＿time autocorrelation f皿dtions the wtiter has 6btained the mean values of・F210di（meah} val促s of、五1（μs））fot several male alld female montlsY1 lab les幽as ， i howrt・、． iri．・・Table’16）∴』ln the ㌔1’ 08

(4)

A・tgm・ti・Rec・9・iti・n・f N・・al・（Shigenaga・Ariizumi）

頑ヒ炉友

_{吻磁cg⑭cw拠w}

ε〃〃coグ輌ダ沙⊥

5蹴旗cン

砺幽磁幽・→2吻

o必∂

300％乙解

c〃∼卯磁

o協砂刀〃α2α成

醐〃

丁00∼！600％

伽

r ξ綱吻！謬轟

、繊

幽

dZ材ク 2800∼互000％侮㎡ρα嵯？oo％〃ρ戸繊τケ磁θ ＾・r ・．．r， ’ ，醐∂ 、7∂0∼3000％

〃24｛4吻

吻・『S微κ《鋤．び

v膨α蹴．．

幽

｛−

0斥畝￠峻．

_告力汐

一 m，γ≧

voμ誠刀砲

疋w

@ζ0’ @ 1_{u一・−」}

励ぱ麗m吻

拓夕魎形μ解輌ゲ〃励

俣蜿R㍑㌫1 Block diagram of the circuit eXtraCting naSal COnSOnantS． F Fig．句η《L，7no徹U

宸メx’w

｜＿＿ ←一白 ● 一 ← 一⇔ 一錫 ■ 一一ヲ％％ヴ．

曜ず

／480り物％ ’；：

．伽

乙物α碑

5嚇畝‘

肋oαフ蝿

：｝｜ 1

畝wi

2 1 t ：れ Moneaizijla zauete’ ぐ

pm nd7ZLca9晦pT

220tt．2・・34a e／s

伽

2780％

幽

o乙涙〃

a

’ 1 ｝ 1 ’ Fig． 5．

〃o輌ψ

ma

5蹴励峻

1’ ：｜ Block diagram of the circuit discriminating

Imlfr・mlnl．

ms，

，imimE maev

me 2，7zadz2iik m“etzi

励幽顧

砿 e ヱ呼’

肋碗cぱ

砿吻吻

んmpnd

ルゆ碗必

mi

AAt’P Wh．cUt’ んUb d 2CLclt’ ηa ni ； ●

(5)

Dec．1964． Reports of the Faculty pf Engneering，Yamanashi Univesity

_No．15

fm・（μ5） male female fm 1 （μ∫） male female

mi

180 〔2780〕（175∼200） 181 〔2760〕（175∼200） ni 155 〔3230〕（150∼175） 169 〔2960〕（150∼188）

me

251 〔1990〕（213∼268） 255 〔1960〕（200∼275） ne 236 〔2120〕（200∼250） 213 〔2350〕（188∼225） ’

ma

488 〔1020〕（450∼550） 475 〔1050〕（350∼550）．na 331 〔15｜0〕（300∼375） 273 〔1830〕（250∼325）

mo

567 〔880〕（550∼600） 534 〔940〕（45q∼575） no 373 〔1340〕（300∼425） 308 〔1620〕（275∼375）

mu

550 〔910〕（400∼625） 498 〔1000〕（325∼φ25） nu 325 〔1540〕（300∼350） 244 〔2050〕（225∼262）・ 1 Table． 1 耳210ci of nasa正monosyllables extracted by short−time autocorrelation functionS． The values in〔〕are those converted into frequency（c／s）． 1

乃嚇乞松呼ノ斑Cφ輪

、

』煙竺

1 β娠丞’￠ω ・ L

c“声44解（認

㌔，、 1 ．「．i・．F目工

一C噛く輪 c娠必

_一4CW．

_碗cw

舷形

烏1

月〃Dcあc威’

’ ！

〃φ磁ψ

「、．

ノ＝ c在cω虻

_5臓碗c必

」臨

_{c〃炉飢吻}

」1L鹿．？磁形「層

」鰍4鋤z 勿κπ

Fig． 2 Block diagraIn of the circuit discriminating lml from lnl by short−time autocorrelation fロnctions． tabl・ん1 i・th・first n？inimum p・int（μ・）・f a c・rre1・ti・n fun・ti・n・nd 1／（2五、）nea，1y gives the frequency of F210cus・ The values 4n （） indicate the region of distribution ・fん1’・・Th・v・1u・in口i・di・at・・th・mean F・1・・u・c・nve・t・d int・f・equ・n・y in。／，． In thi・mea・u・em・nt・becau・e th・3・d・f・・m・nt regi・n i・n・t・x・lud・d， f・r th・f・ll・wing vowels li・el the F210ci are not extracted correctly but for the other vowels the results show proper values a耳d are pretty different between l m． l and l n｝．（の Method using the F210cus7） S・・bythecircui…n・t・u・・i・n・h・wninFig・2・wecandi・tingui・h lmlf・・m lnl by・・mp・・i号9．．m・gnitud…fthe au・…rrel・・i・n fun・ti・n b・f・・e and・ft・・th・p・・p・・d。1。y tim・Tfq・each f・11・wing・・w・1・Th・t・・ult・・f id・ntifi・ati・n・f lml， lnlf。，m。n。．・yll・b1・・ar・・h・wn in T・bl・2・wh・n a． 700∼2200・／・b・nd−pass filter h・・been u・ed in Fig． 2・1n Table 2 a草d 3 ifeither Imlor lnl ismissed to identifyfor the utterance of a −speake・・th・・amp1…f・h・・peak・・a・e c・n・idered t・b・n・t…rectly j。dg。d． S。 in the t・bl…f・・ex・mpl・・‘9’m・an・th・t b・th団・nd l n l・f 9・peaker・am。ng 10。，e di、．・・imin・t・d・nd d・gid・d・・rre・tly・F・・m the re・ult・in T・bl・21mi l・nd lni l。。en。t separated・but lme【and lne l become to be separable by excluding the 3rd． formant 「eglon・

uo

(6)

Automatic Recognition of Nasals _{（Shigenaga・Ariizumi）}

mu

●

nu

τ1・τ2 correct decision （for 10speakers） 400・425 10 425・45・｛45・・475 9 9

mu

■ nu

mo

● ’no τ1．τ2 correct decision （for 10speakers） 400・425 9 425・450 10 450・475 correct 10 decision pre− ceding VOWels 1 a O

ma

● na 400 ● 425 8 8 1 t 425 ● 450 7 8 3 450 ．● 475 7 8 4 be− tween 500 and 600 7 τ1・τ2 COrreCt deCiSiOn （for 1’1 speakers） 350’●375 11 375・400 11 400・425 9

mo

● no

me

● ne τ1・τ2 correct decision．（for 10speakers） 275・300 correct decision 10 pre一 ceding vowelS τ1・τ2 1 a O 400 ■ 425

ml

■ ロ

m

8 8 4 425 ● 450 8 8 4 450 ● 475 8 8 5 be− tween 550 and 625 correct decision （for 14speakers） 8 300・325 4 血a ’ pre− na ceding ．vowelS COrrect Table． 2 Results of discrimination of lml，lnl in monosyllables by short−time autocorrelati・on functions． τ1，τ2 show fhe delay times in pts at Which times fuhctibns ar．e extracted， and the number of correct decision are the number of』speakers for whom both ・lml and ・・1n−，1 were correctly discriminated． decision 1． a O 375 ● 400 8 8 8 400 ● 425 8 8 8 425 ● 450 8 8 8

me

● ne correct decisiσn pre− ceding vowelS サ 1 a O 275 ● 300 3 4 1 di Table． 3 Results lmi， tlme autocorrelation functions for the males． of discrimination of

lnlinwordsby short−

， samples uttered by 8 Refer to Table 2 for notations． Next， for the nasals have the preceding vowels shown in Table 3． In this

preceding vowel lol for

which exsist in words consisting of 2 0r 3 syllables and also Ii・a， o l・ the results of discriminating l m｝ from l n l are case lme l and lne l are not separated， and the effect of the

lmuB lnuB いmoland lnolisverylargeandwemust

select the identifying poillts at the large delay times． From above results， by this method extracting F210cus at the be9inning of vowels， when the followillg vowels are la， o， u l lml，lnlare also discriminated in・words as well as in monosyllables・but when the following vowels are li， e l，especially l i l，it becomes impossible because in the case of vowel lil as showrl in Fig．3 the second and third formants vary very rapidly・For l me l and l ne l in monosyllables the discrimina− ＋ion is almost possible but the difference is delicate． In words， the positions of F210ci not only change according to the preceding vowel but also overlap each other Qr moreover aτe

(7)

Dec．．1964． Reports of the Faculty of Engneering’，Yamanashi University

_No．15

reversed unexpectedly：Therefore by such a simple method it may be difficult to discri− minate lme l from ｜ne l．（d）Method using the frequency spectra near the end of nasal parts Now， if we attend to the frequency spectra for aboutユO ms at the end of nasal parts， that is， in the front of vowels the spectra show， refer to Fig．3， that the frequency of l m l continuing to the 2 nd． formant of the following vowel（F210cus）is lower than that of Inl or in the region continUing to the 3 rd． formant of the vowel lnl has generally large energy in a higher frequency regiori or larger， than that of l m 1．This may be due to the fact that the zero of lnl is in the higher frequency region about 1500∼2200 c／s． So， extracting the frequencies shown in Table 4 by the simple resonance circuits and comparing each other， we can distinguish lml from ｜nl in monosyllables for every

followingvowe1． Whenthefollowingvowels are leland lal or lul and lol，the

outputs・respectively adding up each comparator output are used． Such a血ethod is effective to improve the result of the recognition of l me l，lmu l．The extracted frequencies shown

in Table4are selected atabout F210ci of lml andat aboutthefifth polesOf lnl．

But in ｜mu l and l mi l F2 region does not sometimes appear so explicit that the before and after forth， fifth poles are used for the following vowels l u l and l i l．

叉

↑・

34

§ ？ ¶ き 2 o． ’o 8 6 4 2 0 to 8 6 4 2 Fig．3

No．1

一謝

一一一高 −→− 獅 ①

鰭

霧

② §’° t）1 § 書？ 4 2 0 ‘0 8 6 4 2 0 io 8 6

R4

㍉

’0 8 6 2 0 Fig．3

No．2

，．，一∠二〆・一←mu ・一・− 獅 ①

㌔

we 一←−高 −“一一ne 工 2 3 、／＠ Kgs

112

(8)

Automatic．Recognition of Nasals （Shigenaga・Atiizumi） 1 Fig．3

No．3

Φ②Φ

拙

following vowelS i、 e a O u 量書養遵

m

n 2200 1400 1000 800 2000 3000』‘ 2420 2850 2650 2420 Fig． 3 Table． 4 Frequencies for discriminating b・tween lml・nd l・いi・． c／s． For lml the frequency component of m colum「｝is able to be made larger than that of ncolumn for each following vowel． following vowels κYs Frequency spectrum envelopes of nasal monosyllables before and after the boundaries between the nasals and the following vowels． ①，②，③correspond respectively to the ones marked in the above waves which are ob tained by filter− i・gby 700・／・HPF・・d・h・w th・ boundary． ’ i e a O u ｛

m

2200， 2340 1400 1100 970 870 930 n 2780 2500 2850 1480， 1750 1・250， 1500 1350， 1520 Table． 5 Frequencies for discriminating

between lml and lnl，in

c／s． By thi・meth・d， f・・th・t・t・150 m・n6・yll・b1・・utt・・ed by fiv・・tud・nt・th・numbers

whi。h lmli・d・・id・d・・lnlaret・t・13・・n・f・r〒evさ・y lm・1・lmi1・nd lmu1・f

the same speaker whose formant is in high・There is none of lnl which is decided as

lml．

In monosyllables， above method is very available for the discrilninationもut in words there sligh毛ly changes the situation． Owing to the influence of the preceding sounds・the components at about 2500∼3000c／s of lml are emphasized and often overlap to that of 回in ea・h・yll・b1・・except i’・mi， ni l・Th・・ef・・e・it m・k・・imp・ssib1・t・apPly the ab・v・ method into the discrimination in words．・（c）Method using the frequency spectra before and after the boundarY between nasals and vowels Then， we have considered that the above two methods（a）and （b）are to be used together． That is， the method of（a）is used to discriminate betweell lml aid l n l for the following vowels la， o， u l and （b）for l i，e．1． And．in the method of（a）such m。th・d・・t…血P・・e th・tw・f・equ・n・y・・mp・n・nt・i．・alSg u・ed・The ext・a・t・d f・equ・n・i・・ are shown in Table 5． For the following vowel lil，th6’fr6quencies are nearly identical

(9)

D，c．・．i964．・． R叩。・tS・・f the Fac・ltア。f E・ginc・・ing，’・Y血血・・hi Univ…ity

No．15

身 with（b）alユdifor la，白， u l the．’｛requencies are at abotit F2ユoci’of l m・nl as shown in Tableユ、（Refer also垣Fig．3and 4）． And to avoid the influence of the movement of F2 1。。。，by・h。 p，ecedi・g・。w・いh・；・m。f・h・f・equ・n・y…mp−t・・t tw。 P・in・・i・u・ed・

；Fig．．4 N・・1 ・Fig・4 N・・3

1・ 1° ＿．kim，ru

−一1刑・−8 −一：4・：一’｝｛ine ．．．」＿ iniN 6 4 4 量1・

↑z

−9・・牛・菖一〇ヰ、

＼託、

一申一jami …一一 janl

へ

・〉A占一Lr一已■・A− ・ io 8 ．」P‘ ：量1・璽．1

竃

Fig．4 No ’．2’

’SOMl

．一．．onl 一：！．t°’

．書：

竃

10 8・ 6 ．4． 12 ’ ” lO 8 1 5・ ’4、廊．4．、INo．，．’4 召li・・．1

竃

．量，．

畢

妃 le 9 6 2 一LKimこru ・一・窒j｛naS〈o 一一 ≠轟ﾈ一一・≠詞ﾈ

ハ

”t．．2 一＿’求Bπ直

…on趣

． Kcls ゴ14

(10)

Automatic Reqognition． of Nasals （Shigenagu ・． Ariizumi） le 8 6 4 2 θ き・書i：量 4 2 0 Fig．4 No・．5 一一撃高 −一・ tnotSI 囮．． 8 6 ヰ 2 口一一一 karno −一一 kdnvko 一一汲盾高盾窒s ＿、一一eno ，．’．．・ノ芦輌・・芦！』一 κ98 te 9 5 4 2 Ll ・O

駆

8・4 §・ lo s ら 4 2 Fig．4 ．No．．6・ 1 一一汲盾高浮 −一一jonUKa 一・．＿

Q＿vイr−s−Xh

2 _k鯵 B・tat’撃?E・t．wh・n the p・e・edi・g．v・w・l frequencies spectra m discrimination

viduaL

1100c／s either’．of th’ese speakers twO． makes the total rate of the discrimination dow11．・．、．．．．． ’： The cirouit diagram is shown in・Fig．5．．At．the end of、．the nasal’paτt two mono5table・ multis with time duration l ms and g ms，・are tt三ggered 5ucces5ively、’・Duエing the．nasal part plUs tllis 10 ms a relay is closed and the output filtered． by 700 c／s HPF三s．transported to tlle group「盾?@resonanc已circuits for the discrimination。f・1m1・lnl．・when the follow− ing vowel is one of la， o， u 1，for example s foエ l ma 1，1na l comparing the component of 970 c／s with those of 1480 and 1750 c／s during 10 ms at the beginning of vowels， if the former compDnent is larger then the comparator output is・a・dju5ted tor be positive and is added to the Schmitt circuit． So th6 Schmitt circuit provides・the information of l m l−In the other．nasals too， the’Sch，hitt ci了』uits prbvide outputs．・．only・for l、 m．1．This is multiplied Fig．4 Frequency spEctrum enve！。pes・f nasals bef。re the b・undaries between the nasal5 and the following y．Cwels． foF．1・mi・ni；・me・ne l and．after the・bounda一エies for others， They show the’ihfldence On the F21’oCi’bf the preceding ・。W・1… ．・．∫．・．、・、、．．．・．．、・．．1 ．．、，．i．sゆ1・i迦y． br．rffecli…。・輌gc the・？・t．τ・樋 as・shown in the results obtained by the autocorrelatidn fullctions．↑he frequency ’ lme， ne l have wide variety and祀maエkable difference ampng sp色akers． So the of lme， ne l coUldnrt expect good results without adjusting for each indi一工n・thiS’trial When tWO COrnparatOrS．COmpare・・reSpeCti寸ely 1400 C／S・W三th，2500．C／S and with12850c／s皐as well as（b）， lme IJ is・adjuLsted to indicate positive outPut垣叩mparators．・In．opposition to the result of．monosyllables；for、new．．three good エesults are ’obtained by the formeエfor one and by the latter for the・othet But by the both combination・the．discrimination is・anyhow difficult fOr three． and it

(11)

Dec．1964．・Reports of the Facultyof Engineering， Yamanashi University

No．15

With abOVe 9 rnS mOnOStable mUlti eXCept firSt l mS， and at the・trailing erdge Of thiS OUtpUt amonostable multi with duration Tm is triggered． This duration Tm is selected to Tm＞Tv where Tv is the duration elapsed from the end of the nasal part to the time when the decision of the following vowel is done． By multiplying Tm with the information of the following vowel， l ma l is recognized． On the other hand・at the end of the nasal part a monostable multi of−’which duration is Tn selected as Tn＞Tm十10 ms is triggered and at the，−end・of this・output is・・multiplied with the vowel information， and l．．n l syllable is indicated if the l m l information has not preceded．． When the following vowel is l i l or l e l，by comparing the outputs of resonance circuits in the nasal part the discri血ination of l m l，lnl is carried out． At the trailing edge of this output a monostable血ulti of 5 ms duration is triggered and this is multi− plied with the above monostable multi of duration Tn． Next， at、L the，leading edge of this out輿t a monostable multi of dUration Tmo selected to satisfy’ヤT。＜Tmo〈丁亘is triggered． Thus within 5 ms before the erid of the nasal part the positive output of the comparator represents the information of l：ml． The output beyond that time is neglected． For in the middle of nasal part lnl would represent the information of lml and vice versa． The above monostable multi of duration Tmo is multiplied with the information of the

following vowel．川or lel，the output represents lmi l−or l．me ・1・Without lml

sign「≠P， 1i ni日or’lne］〉亘r6 fec6gnized． ’ tt ’ 〔1 ’．±§4：Re‘bognition’of the．1，q，11Qwi皿9．vQ、 els The durations of vowels in words are frequently shorter thari 100 ms and so a simple apparatus for recognition of the following vowels cgmbined with nasals is made after im− 』pfb・’ing’・th・p・evi・u・m・・h・d；9’aぷa1輌sl rec・gnize・a・5・m・f・・m・h・・nd bf n・・a1 P・，・t・・、1 。 i ・．｝・』』・・．い， §5・ Re．『ult『qnd I）i『「「撃ssion 、：、． 1、，・・Because the…ml，ln．レin words．usually fo・110w the vowels in Japanese and for the purpose．of examining lthe in且uenqe・・， of the preceding vowels・ the・samples have been chosen．so・as to consist．of a few syllables containing 、l i， el a， o， u・I as・the preceding vowels． That is：1 ・1 … ：・’い・・㌧ ’．・∴ ’ 一・∴ f ・・m・．kemiOawa・ liniN enikki

’｛ぽ「u蕊N

“t

o隠、1：ご、

imo ．kemono ｛ inot∫i tenohira

fgimu k・血u・i

いnu l t・nuoui

，These words were ，、、kami kani ；ame ane t ama ’ 1 l ana kqmo・ kanoko ’ l amu tt tanuki

pronounced

goml uml「1一一・‘ ∬ oni．、kdni ．、，『 kome 、 ume ． koneko une ∫ ．・．．、 koma uma・． onaka ，、 unaru．、：

komori kumo

ono ， tsuno ’

komupi kumu

konuka kunuoi

once by three．male． students as t normally as possible

116

(12)

Automatic Recognition of Nasals （Shigenaga・Ariizumi） and the maximum level of speech was adjusted to be within the range of−1．5 to＋1．50n the VU meter attached to the magnetic tape recorder． The results are shown in Table 6． The most ’ Speaker

ml

nl

me

ne

ma

na‘

mo

no

mu

nu tota1 Numbers of the nasal parts which Were nOt eXtraC・− ted

A

B

C

1 1 1 1 1 1 1 5 Numbers of miss decision between

Imland inl

A

B

C

i 1 1 4 1 3 1 1 1 1 1 2 ｜｜ ₁ 1 1 1 1 ｜ 12 8 6 Ntimbers of other n’_{≠唐≠戟@information} Phonemes which showed Speaker

_A

_B

_C

4 9 1 Table．6 Results of recognition of I m l， inl． Samples contain 10 Japanese nasal monosyllables and 50 words shown in the・paper for each of three speakers． 4 cannot be．excluded sufficiently． The ． numbers of nasals which speaker． His voice level is ．the components of But on the other hand as his vowel l u I we adjust to fit his pattern， the、 extraction give the nasal infqrmation increasingly． In the discrimination of lml，lnl

・to stability、of lil．Among lml，lnl

nasal ended earlier or later than the is influenced by the locus， it had better use only the position of whell the different sorts of spectra exist． not adopted it is three of 1100∼ difficult・problem ln the extraction of the nasals is’how to exclude

the vowel lul． The second formant of

lul especially［unstressed in the end of words are weak extremely and not stable． The levels of l u l following the nasals in the ranges of 500∼1600 c／s BPF or 700 c／s HPF are ggnerally r【10re than twice as large as the preceding nasals and in ．300 c／s LPF are the same or rather lower than nasals for individual cases． Hence it seems to be able to distinguish the nasal from lul but as among many syllables the minimum level of l、uI may overlap the maximum level of nasals， the perfect exclusion of the whole duration of such lul becomes impossible by means of above circuits while show・’ 狽? correct information of all nasals． Then it should be required more effective method from another view and more independently how to pronounce． Next， at the beginning or the end、of l i I ．there arise a few cases that the 3rd．，4th． lormants are not conspicuous and，1奄氏@addition， the flapPed consonant lr・1 ．show no information are tota16containing 5 by one somewhat low and cannot operate the（v）circuit ・and because 300c／s LPF become weak and at about 1000 c／s are comparatively large． hardly provides the information of the nasal， if will be performed but the other’s lul may ，the syllables l mi， ni l result in good owing with the erroneous discrimination the part of expected boundary． In this method， as the extraction pattern itself of the fre事ency spectra together with the position of F2 F210cus such as case（a）． But in any method ・・f・・a・a high 9・ad・mech・ni・m with leaming i・ difficult to operate with no adjustment． Typically in l me I，lne l with different patterns for three speakers it is possible to discrimihate them by combination 2850c／s for two speakers and the other one by combination of 1400∼2500 c／s， but

(13)

Dec．1964． Reports’of the Faculty of Engineering，． Yamanashi University

_No．15

the former combination misjudgeS for the other’s． In order to reduce the influence of the preceding vowels， the method combining two resOnance circuits of frequencies shown in Table 5 is a little effective but changing the extracting・frequencies according to the preceding vowels， especially l o l，is necessary for perfect discrimination to be performed． §6． Conclusion This paper has introduced the method that obtains pretty good results in the reco9− hition of l m l，lnlirl monosyllables and words pronounced by three male speakers in real time． ’ In the extraction of nasal parts， it is iMportant to clear the boundary to the vo・Wel and to exclude the vowel l u l，as is difficult． In the discrimination of l m l，lnlexcept lmi l，lni l when the preceding vovsfel is l o l，to change the standard of the discrimina− tion becomes necessary・And the personal differences are very conspicuous in l me， ne｝，so it is necessary to adjust for each speaker or adopt high learning Processes． It is hoped to discriminate l m l from．lnl independently on the following vowel ・and in the duration of the nasal segment but the frequency spectra in the ranges of middle ・and high have variety and vary owing to the exsistence of the zero． Because the zeros of l m l， lnl and l引 are separated from each other， the rapid and accurate extraction of zeros or， 1from the another point of view， the treatment on anat∩mical level are expected． Generally， the most fundamental principle in−pattern recognition is to extract the ’ch・aracteristic features． In speech， this problem is likely to give a solution but the differ− ences for individuals are the most difficult problem for the mechanical recognition of speech． Though making a recognizer for the special person would be expected， the treatment will become difficult as the objects increase、 It seems to be considerably satisfactory to extract ihe：characteristics bu、t there are apPreciably many cases which we cannot seize the essential matters ill addition to imperfection of the treatments． Together with apPlying the method such as analysis by synthesis it is necessary to notice the co−articulation and the relative positionS of each formant and the main frequency of the consonant for the allalysis of voices． Though to solve and clear how the treatment of the frequency spectra in aural mechanism ’is carried out is essential problem， for the mechanical recognition on acoustical level it is more necessary to catch the gross features， to emphasize them while to suppress the others． In extraction of the nasal part as well as discrimination of lml，lnl we have tri6d to enibody this points・but it is not free from imperfection because the extraction of every feature is not done and only gross features are used under the restrictions imposed by treat− ment in real time． In these samples lOl，lNl are contained and they show sufficiently the nasal information but sometimes lNlwith low level is missed by the（v）circuit in§2． The writers thank sincerely to Mr． Takao Horiguchi， Mr． Masahiro Inoue， Miss． Masami Kato and students of our university for their helps． 11．8

(14)

Automatic RecOgnition of Nasals （Shigenaga・Ariizumi） 1） 2） 3） 4） 5） 6） 7） 8） 9） ReferenceS l J．Oi・umi， E． K・b・・J． A・・u・t・S…J・p・n・17・P・3（1954） A．M． Liberman：J． Acoust． Soc． America，29， P．117（1957）etc． K．Nakata： J． Inst． El． Comm． Eng． Japan，42， p．507（1959） G．Fant：Acoustic Theory of Speech Production（Mouton＆Co・1960）Chap・2・4 0．Fujimura： J． Acoust． Soc． America，34， p．1865（1962） M．Shigenaga：Reco．rd of Professional Group Q；i Information Theoty・Inst・El・Comm・Eng・ Japan， July（1961）；J． Inst． E1． Comm・Eng：Japan，45・P44（1962）；Reports of Faculty of Eng． Yamanashi Univ． No．14．C P．53（1963）』． M．Shigenaga， T． Horiguchi：Records of Meeting of Acoust． Soc． Japan， May， P．177（1962） M．Shigenaga， H．Ariizumi， M． Inoue：Records of Meeting of Acgust． Soc． Japan， May， P．101（1964）；M．Shigenagti， H． Ariizumi： Record of Professional Group of Speech， June （1964） K．Kato， H． Kiooka， H． M∬akami：J． Acoust． Soc． Japan，14， P．300（1958）・

Automatic Recognition of Nasals 利用統計を見る