Title
Rahmonicとメルケプストラムを用いた音響モデルに基づ
く騒音環境下叫び声検出の性能評価
Author(s)
福森, 隆寛; 中山, 雅人; 西浦, 敬信; 南條, 浩輝
Citation
電子情報通信学会技術研究報告 = IEICE technical report :
信学技報 (2017), 116(477): 283-286
Issue Date
2017-03
URL
http://hdl.handle.net/2433/228957
Right
Copyright © 2017 by IEICE
Type
Journal Article
Textversion
publisher
ࣾஂ๏ਓ ిࢠใ௨৴ֶձ
THE INSTITUTE OF ELECTRONICS,
INFORMATION AND COMMUNICATION ENGINEERS
৴ֶٕใ
TECHNICAL REPORT OF IEICE.
ʦϙελʔߨԋʧRahmonic ͱϝϧέϓετϥϜΛ༻͍ͨ
ԻڹϞσϧʹجͮ͘૽ԻڥԼڣͼݕग़ͷੑೳධՁ
ོ
†தࢁ խਓ
†Ӝ
ܟ৴
†ೆᑍ ߒً
†††
໋ཱؗେֶ ใཧֶ෦ ˟ 525–8577 ࣎լݝࢢ࿏౦ 1-1-1
††
ژେֶ ֶज़ใϝσΟΞηϯλʔ ˟ 606–8501 ژژࢢࠨژ۠٢ాೋຊদொ
E-mail:
†{
fukumori@fc, mnaka@fc, nishiura@is
}
.ritsumei.ac.jp,
††
nanjo@media.kyoto-u.ac.jp
͋Β·͠
ຊߘͰɼ૽ԻڥԼʹ͓͚Δ Rahmonic ͱϝϧέϓετϥϜʢMel-Frequency Cepstrum Coefficients:
MFCCʣΛ༻͍ͨڣͼݕग़ख๏ʹ͍ͭͯड़ΔɽMFCC ਓؒͷௌ֮ಛੑΛߟྀͨ͠έϓετϥϜͰ͋ΓɼԻ
ӆΛಛఆ͢ΔͨΊͷಓಛྔΛ͍ࣔͯ͠Δɽ·ͨ Rahmonic ɼجຊपͷௐͰ͋Γɼਓؒͷଳӡಈ
ʹؔΘΔಛΛදݱ͍ͯ͠Δɽ͜Ε·Ͱɼզʑେྔͷฏ੩Իͱڣͼ͔Βநग़ͨ͠ MFCC ͱ Rahmonic ʹج͍ͮ
ͯߏஙͨ͠ Gaussian Mixture ModelʢGMMʣΛ༻͍ͯڣͼΛݕग़͍ͯͨ͠ɽຊߘͰɼ͜ͷԻڹϞσϧΛ Hidden
Markov ModelʢHMMʣ Deep Neural NetworkʢDNNʣʹ֦ுͯ͠૽ԻڥԼͰͷڣͼݕग़ੑೳΛධՁͨ͠ɽධ
Ձ࣮ݧͷ݁ՌɼڣͼͷൃػߏʢಓಛੑͱଳಛੑʣΛ MFCC ͱ Rahmonic Λ༻͍ͯޮΑ͘දݱͰ͖Δ͜ͱ
͕֬ೝͰ͖ͨɽՃ͑ͯɼ΄ͱΜͲͷ૽Իڥʹ͓͍ͯԻڹϞσϧͱͯ͠ DNN Λ༻͍Δ͜ͱͰ GMM HMM ΑΓ
ߴ͍ڣͼݕग़ੑೳΛୡͰ͖ͨɽ
Ωʔϫʔυ
ڣͼݕग़ɼ૽ԻڥɼRahmonicɼϝϧέϓετϥϜ
Performance evaluation of noisy shouted speech detection based on
acoustic model with rahmonic and mel-frequency cepstrum coefficients
Takahiro FUKUMORI
†, Masato NAKAYAMA
†, Takanobu NISHIURA
†, and Hiroaki NANJO
†††
College of Information Science and Engineering, Ritsumeikan University. 1-1-1 Nojihigashi, Kusatsu, Shiga,
525–8577, Japan.
††
Academic Center for Computing and Media Studies, Kyoto University. Nihonmatsu-cho, Yoshida,
Sakyo-ku, Kyoto, 606–8501, Japan.
E-mail:
†{
fukumori@fc, mnaka@fc, nishiura@is
}
.ritsumei.ac.jp,
††
nanjo@media.kyoto-u.ac.jp
Abstract This paper describes a method based on new combined features with mel-frequency cepstrum coefficients
(MFCCs) and rahmonic in order to robustly detect a shouted speech in noisy environments. MFCCs collectively
make up mel-frequency cepstrum, and rahmonic shows a subharmonic of fundamental frequency in the cepstrum
domain. In our previous method, Gaussian mixture models (GMM) is constructed with the proposed features
ex-tracted from training data which includes a lot of normal and shouted speech samples. In this paper, evaluation
experiments of noisy shouted speech detection were conducted using not only GMM but also hidden Markov models
(HMM) and deep neural network (DNN). The results show that MFCCs and rahmonic were effective for
represent-ing an utterance mechanism includrepresent-ing both vocal tract and vocal cords. In addition, DNN could achieve higher
performance in noisy environments than GMM and HMM.
Key words Shouted speech detection, Noisy environment, Rahmonic, Mel-frequency cepstrum coefficients
1.
͡ Ί ʹ
҆શ҆৺ͳΒ͠Λࢦͯ͠ɼΧϝϥͳͲͰࡱӨͨ͠ը૾ ใΛ༻͍ͯҟৗࣄଶΛݕ͢Δ൜γεςϜ͕ݚڀ͞Ε͍ͯ Δ[1], [2]
ɽ͔͠͠ͳ͕Βɼ͜ͷΑ͏ͳγεςϜʹɼΧϝϥͷ ࢮ֯Ͱൃੜͨ͠ҟৗࣄଶΛݕ͢Δ͜ͱ͕͍͠ͱ͍͏՝͕— 1 —
୍⯡♫ᅋἲே 㟁Ꮚሗ㏻ಙᏛ ٻٻٻಙᏛᢏሗٻ
گڣڠٻڤکڮگڤگڰگڠٻڪڡٻڠڧڠڞگڭڪکڤڞڮڇٻ ٻٻٻڤڠڤڞڠٻگۀھۃۉۄھڼۇٻڭۀۋۊۍۏٻ
ڤکڡڪڭڨڜگڤڪکٻڜکڟٻڞڪڨڨڰکڤڞڜگڤڪکٻڠکڢڤکڠڠڭڮٻ ٻٻٻڠڜڍڋڌڑڈڌڎڍڇڮڤګڍڋڌڑڈڌړڒڇڮګڍڋڌڑڈڌڍڒٻڃڍڋڌڒڈڋڎڄٻ
ٻ
This article is a technical report without peer review, and its polished and/or extended version may be published elsewhere.
-͞Ε͍ͯΔɽ͜ͷΛղܾ͢ΔͨΊʹɼۙΧϝϥͰܭ ଌͨ͠ը૾ใҎ֎ʹϚΠΫϩϗϯͰܭଌͨ͠Իใ͔Βҟৗ ࣄଶΛݕग़͢ΔΞϓϩʔν͕͞Ε͍ͯΔ
[3]
ʙ[5]
ɽಛʹΧ ϝϥͷࢮ֯ͷঢ়گΛଊ͑ΒΕΔԻใΛݱߦͷ൜γεςϜʹ ࡌ͢Δ͜ͱͰɼҟৗࣄଶͷݕੑೳΛඈ༂తʹ্ͤ͞ΒΕ ΔͱظͰ͖Δɽ ԻใΛͬͯҟৗࣄଶΛݕ͢Δख๏ͱͯ͠ɼ͜Ε·Ͱ ʹඇৗతͳԻͰ͋ΔڣͼΛݕग़͢Δख๏͕ଟ͘ఏҊ ͞Ε͖ͯͨ[6]
ʙ[9]
ɽ͔͠͠ɼ͜ΕΒͷख๏ʹൃ༰ධ ՁڥͷSNR
ʹେ͖͘ґଘ͢Δͱ͍͏͕͋ͬͨɽ͜ͷ Α͏ͳΛղܾ͢ΔͨΊͷΞϓϩʔνͱͯ͠ɼϝϧέϓε τϥϜʢMel-frequency cepstral coefficients: MFCC
ʣʹج͍ͮͯߏஙͨ͠
GMM
ʢGaussian Mixture Model
ʣΛ༻͍Δख๏
[10], [11]
͕͋Γɼ૽ԻڥԼͰؤ݈ʹڣͼΛݕग़Ͱ͖Δ͜ ͱ͕ใࠂ͞Ε͍ͯΔɽϝϧέϓετϥϜԻͷൃػߏͷத ͰಛʹಓใΛॏతʹදݱ͍ͯ͠Δ͕ɼ͜͜Ͱߋʹଳ ใʹؔΘΔԻಛྔΛՃຯ͠ͳ͕ΒڣͼΛੳ͢Δ͜ͱ Ͱڣͼݕग़ੑೳͷ্͕ظͰ͖Δɽ զʑɼ͜Ε·ͰʹRahmonic
ͱݺΕΔجຊपͷௐ ͕ڣͼݕग़ʹ༗ޮͰ͋Δ͜ͱΛ໌Β͔ʹ͠ɼैདྷͷϝ ϧέϓετϥϜͱซ༻͠ͳ͕ΒڣͼΛݕग़͢Δํ๏ΛఏҊ͠ ͨ[12]
ɽ͜ͷख๏Ͱɼେྔͷฏ੩Իͱڣͼ͔Βநग़ͨ͠ ϝϧέϓετϥϜͱRahmonic
ʹج͍ͮͯߏஙͨ͠GMM
Λ ༻͍ͯڣͼΛݕग़͍ͯͨ͠ɽຊߘͰɼݕग़ʹ༻͍ΔԻڹϞσϧΛ
GMM
͚ͩͰͳ͘ɼHMM
ʢHidden Markov Model
ʣDNN
ʢDeep Neural Network
ʣʹ֦ு͠ɼͦΕͧΕͷԻڹϞσ ϧʹରͯ͠૽ԻڥԼʹ͓͚Δڣͼͷݕग़ੑೳΛධՁ͢Δɽ2. Rahmonic
ͱϝϧέϓετϥϜΛ༻͍ͨڣ
ͼݕग़
2. 1
ԻಛྔʢϝϧέϓετϥϜɾRahmonic
ʣ զʑɼ͜Ε·ͰʹRahmonic
ͱϝϧέϓετϥϜΛ༻͍ͨ ڣͼݕग़๏ΛఏҊͨ͠[12]
ɽϝϧέϓετϥϜɼਓؒͷௌ ֮ಛੑΛߟྀͨ͠έϓετϥϜͰ͋ΓɼԻೝࣝͰԻӆ Λಛఆ͢ΔͨΊͷಓಛྔͱͯ͠༻͍ΒΕ͍ͯΔ[13]
ɽҰํɼRahmonic
ɼجຊपͷௐͰ͋Γɼਓؒͷଳӡ ಈʹؔΘΔಛΛදݱ͢Δ[14]
ɽͦͯ͠ɼैདྷݚڀ[12]
ʹ͓͍ ͯɼ͜ΕΒͷԻಛྔ͕ฏ੩ԻͱڣͼͰҟͳΔ͜ͱ͕ใ ࠂ͞Ε͍ͯΔɽ ͜͜Ͱɼਤ1
ͱਤ2
ʹฏ੩Իͱڣͼʹର͢Δରύϫʔ εϖΫτϧͱέϓετϥϜΛࣔ͢ɽ·ͣରύϫʔεϖΫτϧ ʹண͢Δͱɼਤ1(a)
ͷฏ੩ԻΑΓਤ1(b)
ͷڣͼͷௐ ͕ڧௐ͞ΕͯදΕ͍ͯΔ͜ͱ͕֬ೝͰ͖Δɽ·ͨέϓε τϥϜʹ͓͍ͯɼਤ2(a)
ͷฏ੩ԻʹݦஶʹදΕͳ͔ͬ ͨRahmonic
Λਤ2(b)
ͷڣͼͰ໌֬ʹ֬ೝ͢Δ͜ͱ͕Ͱ ͖Δɽ͜ͷΑ͏ʹपྖҬέϓετϥϜྖҬͰฏ੩Ի ͱڣͼͷؒʹࠩҟ͕͋Δ͜ͱ͔ΒɼϝϧέϓετϥϜRahmonic
Λ༻͍Δ͜ͱͰߴਫ਼ʹڣͼΛݕग़Ͱ͖ΔՄೳੑ ͕͋Δͱߟ͑ΒΕΔɽ 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 Frequency [kHz] Power [dB] 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 Frequency [kHz] Power [dB] (a) ฏ੩Ի (b) ڣͼ ਤ 1 ฏ੩ԻͱڣͼͷରύϫʔεϖΫτϧ 0 1 2 3 0 2 4 6 8 10 Quefrency [ms] Amplitude F0 Rahmonic 0 1 2 3 0 2 4 6 8 10 Quefrency [ms] Amplitude F0 Rahmonic (a) ฏ੩Ի (b) ڣͼ ਤ 2 ฏ੩ԻͱڣͼͷέϓετϥϜ Result Normal speech model Shouted speech model GMM, HMM or DNN Normal speech Shouted speech Feature extraction Normal speech model Shouted speech model GMM, HMM or DNN Recorded speech Observed speech[Shouted speech detection]
[GMM, HMM or DNN construction]
( ) MFCCs, ΔMFCCs, Rahmonic Feature extraction ( ) MFCCs, ΔMFCCs, Rahmonic Feature extraction ( ) MFCCs, ΔMFCCs, Rahmonic or Normal speech Shouted speech ਤ 3 ڣͼݕग़ΞϧΰϦζϜͷ֓ཁ2. 2
ݕग़ΞϧΰϦζϜ ਤ3
ʹڣͼͷݕग़खॱΛࣔ͢ɽڣͼΛݕग़͢Δํ๏ͱ ͯ͠ɼ͡Ίʹ༧Ίऩͨ͠ฏ੩Իͱڣͼ͔Βநग़ͨ͠Rahmonic
ͱϝϧέϓετϥϜΛ༻͍ͯԻڹϞσϧΛߏங͢Δɽ ࣍ʹɼ࣮ࡍͷධՁڥͰऩͨ͠؍ଌԻ͔ΒRahmonic
ͱϝ ϧέϓετϥϜΛநग़͠ɼ͜ΕΒͷԻಛྔͱֶशͨ͠Իڹ ϞσϧΛ༻͍ͯ؍ଌԻΛฏ੩Իͱڣͼͷ͍ͣΕ͔ʹྨ ͢Δɽ ैདྷख๏ԻڹϞσϧͱͯ͠GMM
Λར༻͍ͯ͠Δ͕ɼຊߘ Ͱڣͼݕग़ʹ༻͍ΔԻڹϞσϧΛैདྷͷGaussian Mixture
Model (GMM)
͔ΒHidden Markov Model (HMM)
Deep
Neural Network
ʢDNN
ʣʹ֦ுͯ͠ɼͦΕͧΕͷԻڹϞσϧ ͕ڣͼݕग़ʹ༩͑ΔӨڹΛධՁ͢ΔɽGMM
؍ଌԻʹର ͢ΔฏۉతͳԻಛྔΛ༻͍ͯڣͼΛϞσϧԽ͍ͯ͠ΔɽHMM
Իಛྔͷ࣌ؒతมԽΛදݱͰ͖ΔԻڹϞσϧͰ͋ Γɼڣͼฏ੩Իͱൺͯൃ࣌ؒΤωϧΪʔͷ࣌ؒม ಈ͕ҟͳΔ͜ͱ[15]
͔ΒɼHMM
Λ༻͍Δ͜ͱͰԻಛྔͷ ࣌ؒߏߟྀ͢Δ͜ͱͰڣͼݕग़ͷੑೳվળ͕ظͰ͖Δɽ ͦͯ͠ɼDNN
χϡʔϥϧωοτϫʔΫͷ1
ͭͰ͋Γɼωοτ— 2 —
284
-ද 1 ࣮ ݧ ݅
Training data Female speaker: 400 samples Male speaker: 400 samples Testing data Female speaker: 100 samples
Male speaker: 100 samples Sampling 16 kHz / 16 bit Acoustic 12 orders MFCC feature 12 orders ΔMFCC 1 order Rahmonic Acoustic 1. GMM model 2. HMM (̏ states) 3. DNN
Noise White noise, Speech bubble [18] SNR 0, 10, 20,∞ dB
Frame length 25 ms (Hamming window) Frame shift 10 ms ϫʔΫͰਂ͍ߏΛ༗͢ΔɽಛʹೖྗΛԻڹಛྔʢຊ ߘͷ߹ɼϝϧέϓετϥϜ
Rahmonic
ʣɼग़ྗΛൃ༷ ࣜʢฏ੩Իͱڣͼʣͱͯ͠ରԠ͚Δ͜ͱͰɼDNN
Λڣ ͼݕग़ͷͨΊͷԻڹϞσϧͱͯ͠༻͢Δ͜ͱ͕Ͱ͖Δɽ· ͨωοτϫʔΫʹೖྗ͞ΕͨԻڹಛྔʹରͯ͠ॏΈ͚Λߦ ͍ͳ͕Βग़ྗ·Ͱൖ͢ΔաఔɼධՁڥʹґଘͤͣʹڣ ͼݕग़ʹ༗ޮͳಛΛॏతʹநग़Ͱ͖Δͱߟ͑ΒΕΔɽ3.
ධ Ձ ࣮ ݧ
3. 1
࣮ ݧ ݅ ຊ࣮ݧͰɼΫϦʔϯԻ(
உঁ֤400
ൃ)
ʹࡶԻΛ4
छ ྨͷSNR
ʢ∞, 20, 10, 0 dB
ʣͰՃࢉֶͨ͠शԻΛ༻͍ͯੑ ผґଘͷϚϧνίϯσΟγϣϯԻڹϞσϧ(GMM
ɼ3
ঢ়ଶͷHMM
ɼDNN
ʣΛߏஙͨ͠ɽGMM
ͱHMM
ͷࠞ߹ɼ8
छ ྨʢ1
ɼ2
ɼ4
ɼ8
ɼ16
ɼ32
ɼ64
ɼ128
ʣΛ༻͍ͯධՁΛߦͬͨɽ· ͨGMM
ͱHMM
ͷߏஙʹHTK [16]
ΛɼDNN
ͷߏஙʹKaldi [17]
Λ༻͍ͨɽDNN
Ͱ༻͍Δ֤Իڹಛྔͷ౷߹ϑϨʔ Ϝɼ1
ϑϨʔϜʢݱࡏϑϨʔϜͷΈʣɼ7
ϑϨʔϜʢલޙ3
ϑϨʔϜΛؚΉʣɼ11
ϑϨʔϜʢલޙ5
ϑϨʔϜΛؚΉʣͷ3
छྨɼӅΕ3
ʢ֤ͷૉࢠ20
ʣͱͨ͠ɽ·ͨൃ༷ ࣜͷࣝผͰɼऀΦʔϓϯςετΛఆͯ͠ԻڹϞσϧͷֶ शͰ༻͍ͨԻͱҟͳΔऀԻΛ༻͍ͨɽԻಛྔͱ͠ ͯɼϝϧέϓετϥϜ୯ମɼRahmonic
୯ମɼϝϧέϓετϥ ϜͱRahmonic
ซ༻ͷ3
छྨͱͨ͠ɽࡶԻɼNOISEX-92 [18]
ΑΓϗϫΠτϊΠζͱεϐʔνόϒϧࡶԻΛ༻͍ͨɽධՁࢦඪ ͱͯ͠ɼશͯͷฏ੩Իͱڣͼͷɼਖ਼͘͠ൃ༷͕ࣜࣝผ ͞ΕͨԻαϯϓϧͷׂ߹(
ࣝผʣ[%]
Λ༻͍ͨɽ·ͨຊ ࣮ݧͰ༻ҙͰ͖ͨධՁԻ͕গྔͰ͋Δ͜ͱΛߟྀͯ͠ɼࠓճ 5
ׂަࠩݕఆΛ࣮ࢪͨ͠ɽ3. 2
࣮ ݧ ݁ Ռ ਤ4
ͱਤ5
ʹԻڹϞσϧͱͯ͠GMM
ͱHMM
Λ༻͍ͨͱ ͖ͷࠞ߹ผͷฏۉࣝผΛࣔ͢ɽ·ͣਤ4
ͷGMM
Λ༻͍ͨ ݁ՌͰɼϝϧέϓετϥϜͱRahmonic
Λ༻͍ͯࠞ߹128
ͷԻڹϞσϧΛߏஙͨ݅͠ɼͦͯ͠Λਤ5
ͷHMM
ͷ݁ՌͰ 65 70 75 80 85 90 Identification accuracy [%] 1 2 4 8 16 32 64 128 MFCCs Rahmonic MFCCs+Rahmonic GMM Mixed number ਤ 4 ฏۉࣝผʢԻڹϞσϧɿGMMʣ 65 70 75 80 85 90 Identification accuracy [%] 1 2 4 8 16 32 64 128 MFCCs Rahmonic MFCCs+Rahmonic HMM Mixed number ਤ 5 ฏۉࣝผʢԻڹϞσϧɿHMMʣ ਤ 6 ԻڹϞσϧ͝ͱͷฏۉࣝผ ɼϝϧέϓετϥϜͱRahmonic
Λ༻͍ͯࠞ߹64
ͷԻڹ ϞσϧΛߏஙͨ݅͠ʹ͓͍ͯߴ͍ࣝผੑೳΛୡ͢Δ͜ͱ͕ Ͱ͖ͨɽ͜ͷ͜ͱΑΓɼϝϧέϓετϥϜͱRahmonic
Λซ༻ ͢Δ͜ͱͰڣͼͷಛΛޮՌతʹදݱͰ͖͍ͯΔ͜ͱ͕֬ೝ Ͱ͖ͨɽ·ͨԻڹϞσϧͷੑೳΛൺֱ͢ΔͱɼGMM
͕HMM
ΑΓ࠷ߴͷࣝผΛ্ճͬͨɽ͜ΕHMM
ͰԻಛྔ ͷ࣌ؒతมԽΛਖ਼֬ʹදݱͰ͖͍ͯͳ͔ͬͨ͜ͱΛ͓ࣔͯ͠Γɼ ࠓޙฏ੩Իͱڣͼͷಛྔʹର͢Δ۩ମతͳ࣌ؒతมԽ Λੳ͢Δඞཁ͕͋Δɽ ࣍ʹද2
ʙ3
ʹDNN
Λ༻͍ͨൃ༷ࣜͷࣝผʢஉঁผʣɼ ਤ6
ʹԻڹϞσϧ͝ͱͷฏۉࣝผΛࣔ͢ɽදதͷଠࣈ֤ ڥʹ͓͚Δ࠷ߴࣝผΛɼͦͯ͠ʮM
ʯ,
ʮR
ʯ,
ʮM+R
ʯɼͦ ΕͧΕϝϧέϓετϥϜɼRahmonic
ɼ྆ಛྔซ༻ͷ݁ՌΛ ࣔ͢ɽ·ͣDNN
Λ༻͍ͨൃ༷ࣜͷࣝผʹண͢Δͱɼࡶ ԻͷSNR
ʹؔͳ͘શͯͷڥʹ͓͍ͯ90 %
Ҏ্ͷࣝผΛ ୡ͢Δ͜ͱ͕Ͱ͖ͨɽಛʹϝϧέϓετϥϜͱRahmonic
Λද 2 ঁੑऀͷࣝผ [%] (ԻڹϞσϧɿDNN) Number of SNR=∞ dB SNR=20 dB SNR=10 dB SNR=0 dB input frames M R M+R M R M+R M R M+R M R M+R 1 frame 92.3 69.5 90.5 94.4 59.4 93.7 92.5 68.4 91.6 90.8 87.4 92.2 7 frames 95.3 68.0 95.6 96.4 65.9 96.9 95.3 75.7 95.7 94.2 93.4 95.7 11 frames 95.1 68.6 95.7 96.5 66.6 96.8 95.5 75.7 95.8 96.4 94.9 95.5 *M: MFCCs, R: Rahmonic, M+R: MFCCs and Rahmonic
ද 3 உੑऀͷࣝผ [%] (ԻڹϞσϧɿDNN) Number of SNR=∞ dB SNR=20 dB SNR=10 dB SNR=0 dB input frames M R M+R M R M+R M R M+R M R M+R 1 frame 84.9 73.7 93.4 90.5 70.4 94.8 90.3 73.8 94.3 82.7 61.7 87.1 7 frames 89.4 80.7 95.1 93.8 79.1 96.1 93.3 80.9 95.7 87.5 54.9 92.2 11 frames 91.2 81.3 94.5 94.5 80.1 96.0 94.1 80.7 96.1 88.6 53.4 91.5 *M: MFCCs, R: Rahmonic, M+R: MFCCs and Rahmonic
ซ༻ͯ͠
DNN
Λߏஙͨ͠߹ɼന৭ࡶԻʢSNR=0 dB
ʣΛআ ͘શͯͷڥʹ͓͍ͯ࠷ࣝผ͕ߴ͔ͬͨɽ͜ͷ݁Ռ͔Β ൃػߏʢಓಛੑͱଳಛੑʣΛͦΕͧΕϝϧέϓετϥϜ ͱRahmonic
Λ༻͍ͯޮΑ͘දݱͰ͖͍ͯΔͱߟ͑ΒΕΔɽ ͦͯ͠ԻڹϞσϧ͝ͱͷ݁Ռʹண͢ΔͱɼGMM
HMM
Λ༻͍ͨฏۉࣝผશͯ90 %
ΛԼճͬͨͷʹରͯ͠ɼϝϧ έϓετϥϜͱRahmonic
Λซ༻ͯ͠DNN
Λߏஙͨ͠߹ͷ ฏۉࣝผ͕95 %
Ҏ্Λୡͨ͠ɽ͜ΕDNN
͕GMM
HMM
ͱൺֱͯ͠ࡶԻͷӨڹΛड͚ͣʹڣͼݕग़ʹ༗ޮͳಛ Λॏతʹநग़Ͱ͖ͨͨΊͩͱߟ͑ΒΕΔɽҎ্ͷ͜ͱΑΓɼ ϝϧέϓετϥϜͱRahmonic
ʹج͍ͮͯߏஙͨ͠DNN
͕ڣ ͼݕग़ʹ༗ޮͰ͋Δ͜ͱΛ֬ೝͰ͖ͨɽ4.
͓ Θ Γ ʹ
ຊߘͰɼRahmonic
ͱϝϧέϓετϥϜΛ༻͍ͨڣͼ ݕग़ʹ͓͍ͯɼൃ༷ࣜͷࣝผʹ༻͍ΔԻڹϞσϧΛैདྷͷGMM
͔ΒHMM
DNN
֦ுͯ͠ɼԻڹϞσϧͷҧ͍͕ڣ ͼͷݕग़ੑೳʹ༩͑ΔӨڹΛධՁͨ͠ɽ࣮ݧ݁ՌΑΓɼͲͷ ԻڹϞσϧΛ༻͍ͨ߹ͰRahmonic
ͱϝϧέϓετϥϜΛ ซ༻͢Δ͜ͱͰߴ͍ڣͼݕग़ੑೳΛ࣮ݱ͢Δ͜ͱ͕Ͱ͖ͨɽ ͞Βʹڣͼݕग़ʹ༗ޮͳಛΛॏతʹநग़Ͱ͖ΔDNN
Λ ༻͍Δ͜ͱͰࡶԻͷछྨSNR
ʹؔͳ͘90 %
Ҏ্ͷڣͼ ݕग़ੑೳΛୡ্ͨ͠ʹɼैདྷͷGMM
HMM
ΑΓڣͼ ͷݕग़ੑೳ͕վળͨ͠ɽࠓޙɼ࣮ڥΛఆͯ͠ࡶԻ͚ͩ Ͱͳ͘ڹࠞೖ͢ΔڥͰͷڣͼݕग़ධՁʹऔΓΉܭը Ͱ͋Δɽ ँࣙ ຊݚڀͷҰ෦ɼՊݚඅʢ16K16094
ʣͷݚڀॿΛ ड͚ͨɽ จ ݙ[1] W. Yao-Dong, T. Takeshi, and I. Idaku, “HFR-video-based machinery surveillance for high-speed periodic operations,” Journal of System Design and Dynamics, vol. 5, no. 6, pp. 1310-1325, 2011.
[2] W. Huang, T. K. Chiew, H. Li, T. S. Kok, and J. Biswas, “Scream detection for home applications,” 5th IEEE Con-ference on Industrial Electronics and Applications, pp. 2115-2120, 2010.
[3] M. Cowling, “Comparison of techniques for environmental
sound recognition,” Pattern Recognition Letter, vol. 24, no. 15, pp. 2895-2907, 2003.
[4] K. M. Kim, J. W. Jung, S. Y. Chun, and K. S. Park, “Acoustic intruder detection system for home security,” IEEE Transaction on Consumer Electronics, vol. 51, no. 1, pp. 130-138, 2005.
[5] K. Hayashida, J. Ogawa, M. Nakayama, T. Nishiura, and Y. Yamashita, “Multi-stage identification for abnor-mal/warning sounds detection based on maximum likeli-hood classification,” ICA2013, PaperID:1pSPb4, 2013. [6] J. L. Rouas, J. Louradour, and S. Ambellouis, “Audio
events detection in public transport vehicle,” IEEE Intelli-gent Transportation Systems Conference, pp. 733-738, 2006. [7] P. K. Atrey, N. C. Maddage, and M. S. Kankanhalli, “Audio based event detection for multimedia surveillance,” ICASSP 2006, pp. 813-816, 2006.
[8] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “An adap-tive framework for acoustic monitoring of potential haz-ards,” EURASIP Journal on Audio, Speech, and Music Processing, 2009.
[9] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, “Scream and gunshot detection and localization for audio-surveillance systems,” IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 21-26, 2007. [10] J. Pohjalainen, P. Alku, and T. Kinnunen, “Shout detection
in noise,” ICASSP 2011, pp. 4968-4971, 2011.
[11] W. Huang, T. K. Chiew, H. Li, T. S. Kok, and J. Biswas, “Scream detection for home applications,” Industrial Elec-tronics and Applications 2010, pp. 2115-2120, 2010. [12] ֟ ਓ, ོ, தࢁ խਓ, Ӝ ܟ৴, ೆᑍ ߒً,
“Rah-monic ͱϝϧέϓετϥϜΛ༻͍ͨڣͼݕग़ͷݕ౼,” ຊԻ ڹֶձ 2013 ळقݚڀൃදձ, pp. 169-170, 2013.
[13] J. Benesty, M. M. Sondhi, and Y. Huang, “Springer hand-book of speech processing,” Springer, 2008.
[14] A. M. Noll, “Cepstrum Pitch Determination,” Journal of the Acoustical Society of America, vol. 41, no. 2, pp. 203-309, 1967.
[15] C Zhang and J.H.L Hansen “Analysis and classification of speech mode: whispered through shouted,” INTER-SPEECH 2007, pp. 2289-2292, 2007.
[16] HTK Software Toolkit, http://htk.eng.cam.ac.uk/ [17] Kaldi, http://kaldi-asr.org/doc/index.html
[18] A. Varga and H.J.M. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247-251.