• 検索結果がありません。

THE INSTITUTE OF ELECTRONICS, TECHNICAL REPORT OF IEICE. INFORMATION AND COMMUNICATION ENGINEERS

N/A
N/A
Protected

Academic year: 2021

シェア "THE INSTITUTE OF ELECTRONICS, TECHNICAL REPORT OF IEICE. INFORMATION AND COMMUNICATION ENGINEERS"

Copied!
5
0
0

読み込み中.... (全文を見る)

全文

(1)

Title

Rahmonicとメルケプストラムを用いた音響モデルに基づ

く騒音環境下叫び声検出の性能評価

Author(s)

福森, 隆寛; 中山, 雅人; 西浦, 敬信; 南條, 浩輝

Citation

電子情報通信学会技術研究報告 = IEICE technical report :

信学技報 (2017), 116(477): 283-286

Issue Date

2017-03

URL

http://hdl.handle.net/2433/228957

Right

Copyright © 2017 by IEICE

Type

Journal Article

Textversion

publisher

(2)

ࣾஂ๏ਓ ిࢠ৘ใ௨৴ֶձ

THE INSTITUTE OF ELECTRONICS,

INFORMATION AND COMMUNICATION ENGINEERS

৴ֶٕใ

TECHNICAL REPORT OF IEICE.

ʦϙελʔߨԋʧRahmonic ͱϝϧέϓετϥϜΛ༻͍ͨ

ԻڹϞσϧʹجͮ͘૽Ի؀ڥԼڣͼ੠ݕग़ͷੑೳධՁ

෱৿ ོ׮

தࢁ խਓ

੢Ӝ

ܟ৴

ೆᑍ ߒً

††

໋ཱؗେֶ ৘ใཧ޻ֶ෦ ˟ 525–8577 ࣎լݝ૲௡ࢢ໺࿏౦ 1-1-1

††

ژ౎େֶ ֶज़৘ใϝσΟΞηϯλʔ ˟ 606–8501 ژ౎෎ژ౎ࢢࠨژ۠٢ాೋຊদொ

E-mail:

†{

fukumori@fc, mnaka@fc, nishiura@is

}

.ritsumei.ac.jp,

††

nanjo@media.kyoto-u.ac.jp

͋Β·͠

ຊߘͰ͸ɼ૽Ի؀ڥԼʹ͓͚Δ Rahmonic ͱϝϧέϓετϥϜʢMel-Frequency Cepstrum Coefficients:

MFCCʣΛ༻͍ͨڣͼ੠ݕग़ख๏ʹ͍ͭͯड़΂ΔɽMFCC ͸ਓؒͷௌ֮ಛੑΛߟྀͨ͠έϓετϥϜ܎਺Ͱ͋ΓɼԻ

ӆΛಛఆ͢ΔͨΊͷ੠ಓಛ௃ྔΛ͍ࣔͯ͠Δɽ·ͨ Rahmonic ͸ɼجຊप೾਺ͷ௿ௐ೾੒෼Ͱ͋Γɼਓؒͷ੠ଳӡಈ

ʹؔΘΔಛ௃Λදݱ͍ͯ͠Δɽ͜Ε·Ͱɼզʑ͸େྔͷฏ੩Ի੠ͱڣͼ੠͔Βநग़ͨ͠ MFCC ͱ Rahmonic ʹج͍ͮ

ͯߏஙͨ͠ Gaussian Mixture ModelʢGMMʣΛ༻͍ͯڣͼ੠Λݕग़͍ͯͨ͠ɽຊߘͰ͸ɼ͜ͷԻڹϞσϧΛ Hidden

Markov ModelʢHMMʣ΍ Deep Neural NetworkʢDNNʣʹ֦ுͯ͠૽Ի؀ڥԼͰͷڣͼ੠ݕग़ੑೳΛධՁͨ͠ɽධ

Ձ࣮ݧͷ݁Ռɼڣͼ੠ͷൃ੠ػߏʢ੠ಓಛੑͱ੠ଳಛੑʣΛ MFCC ͱ Rahmonic Λ༻͍ͯޮ཰Α͘දݱͰ͖Δ͜ͱ

͕֬ೝͰ͖ͨɽՃ͑ͯɼ΄ͱΜͲͷ૽Ի؀ڥʹ͓͍ͯԻڹϞσϧͱͯ͠ DNN Λ༻͍Δ͜ͱͰ GMM ΍ HMM ΑΓ΋

ߴ͍ڣͼ੠ݕग़ੑೳΛୡ੒Ͱ͖ͨɽ

Ωʔϫʔυ

ڣͼ੠ݕग़ɼ૽Ի؀ڥɼRahmonicɼϝϧέϓετϥϜ

Performance evaluation of noisy shouted speech detection based on

acoustic model with rahmonic and mel-frequency cepstrum coefficients

Takahiro FUKUMORI

, Masato NAKAYAMA

, Takanobu NISHIURA

, and Hiroaki NANJO

††

College of Information Science and Engineering, Ritsumeikan University. 1-1-1 Nojihigashi, Kusatsu, Shiga,

525–8577, Japan.

††

Academic Center for Computing and Media Studies, Kyoto University. Nihonmatsu-cho, Yoshida,

Sakyo-ku, Kyoto, 606–8501, Japan.

E-mail:

†{

fukumori@fc, mnaka@fc, nishiura@is

}

.ritsumei.ac.jp,

††

nanjo@media.kyoto-u.ac.jp

Abstract This paper describes a method based on new combined features with mel-frequency cepstrum coefficients

(MFCCs) and rahmonic in order to robustly detect a shouted speech in noisy environments. MFCCs collectively

make up mel-frequency cepstrum, and rahmonic shows a subharmonic of fundamental frequency in the cepstrum

domain. In our previous method, Gaussian mixture models (GMM) is constructed with the proposed features

ex-tracted from training data which includes a lot of normal and shouted speech samples. In this paper, evaluation

experiments of noisy shouted speech detection were conducted using not only GMM but also hidden Markov models

(HMM) and deep neural network (DNN). The results show that MFCCs and rahmonic were effective for

represent-ing an utterance mechanism includrepresent-ing both vocal tract and vocal cords. In addition, DNN could achieve higher

performance in noisy environments than GMM and HMM.

Key words Shouted speech detection, Noisy environment, Rahmonic, Mel-frequency cepstrum coefficients

1.

͸ ͡ Ί ʹ

҆શ҆৺ͳ฻Β͠Λ໨ࢦͯ͠ɼΧϝϥͳͲͰࡱӨͨ͠ը૾৘ ใΛ༻͍ͯҟৗࣄଶΛݕ஌͢Δ๷൜γεςϜ͕ݚڀ͞Ε͍ͯ Δ

[1], [2]

ɽ͔͠͠ͳ͕Βɼ͜ͷΑ͏ͳγεςϜʹ͸ɼΧϝϥͷ ࢮ֯Ͱൃੜͨ͠ҟৗࣄଶΛݕ஌͢Δ͜ͱ͕೉͍͠ͱ͍͏՝୊͕

— 1 —

୍⯡♫ᅋἲே 㟁Ꮚ᝟ሗ㏻ಙᏛ఍                  ٻٻٻಙᏛᢏሗٻ

گڣڠٻڤکڮگڤگڰگڠٻڪڡٻڠڧڠڞگڭڪکڤڞڮڇٻ                  ٻٻٻڤڠڤڞڠٻگۀھۃۉۄھڼۇٻڭۀۋۊۍۏٻ

ڤکڡڪڭڨڜگڤڪکٻڜکڟٻڞڪڨڨڰکڤڞڜگڤڪکٻڠکڢڤکڠڠڭڮٻ             ٻٻٻڠڜڍڋڌڑڈڌڎڍڇڮڤګڍڋڌڑڈڌړڒڇڮګڍڋڌڑڈڌڍڒٻڃڍڋڌڒڈڋڎڄٻ

ٻ

This article is a technical report without peer review, and its polished and/or extended version may be published elsewhere.

(3)

-࢒͞Ε͍ͯΔɽ͜ͷ໰୊Λղܾ͢ΔͨΊʹɼۙ೥͸ΧϝϥͰܭ ଌͨ͠ը૾৘ใҎ֎ʹϚΠΫϩϗϯͰܭଌͨ͠Ի৘ใ͔Βҟৗ ࣄଶΛݕग़͢ΔΞϓϩʔν͕஫໨͞Ε͍ͯΔ

[3]

ʙ

[5]

ɽಛʹΧ ϝϥͷࢮ֯ͷঢ়گΛଊ͑ΒΕΔԻ৘ใΛݱߦͷ๷൜γεςϜʹ ౥ࡌ͢Δ͜ͱͰɼҟৗࣄଶͷݕ஌ੑೳΛඈ༂తʹ޲্ͤ͞ΒΕ Δͱظ଴Ͱ͖Δɽ Ի৘ใΛ࢖ͬͯҟৗࣄଶΛݕ஌͢Δख๏ͱͯ͠ɼ͜Ε·Ͱ ʹඇ೔ৗతͳԻ੠Ͱ͋Δڣͼ੠Λݕग़͢Δख๏͕਺ଟ͘ఏҊ ͞Ε͖ͯͨ

[6]

ʙ

[9]

ɽ͔͠͠ɼ͜ΕΒͷख๏ʹ͸ൃ࿩಺༰΍ධ Ձ؀ڥͷ

SNR

ʹେ͖͘ґଘ͢Δͱ͍͏໰୊͕͋ͬͨɽ͜ͷ Α͏ͳ໰୊Λղܾ͢ΔͨΊͷΞϓϩʔνͱͯ͠ɼϝϧέϓε τϥϜʢ

Mel-frequency cepstral coefficients: MFCC

ʣʹجͮ

͍ͯߏஙͨ͠

GMM

ʢ

Gaussian Mixture Model

ʣΛ༻͍Δख

[10], [11]

͕͋Γɼ૽Ի؀ڥԼͰؤ݈ʹڣͼ੠Λݕग़Ͱ͖Δ͜ ͱ͕ใࠂ͞Ε͍ͯΔɽϝϧέϓετϥϜ͸Ի੠ͷൃ੠ػߏͷத Ͱ΋ಛʹ੠ಓ৘ใΛॏ఺తʹදݱ͍ͯ͠Δ͕ɼ͜͜Ͱߋʹ੠ଳ ৘ใʹؔΘΔԻ੠ಛ௃ྔΛՃຯ͠ͳ͕Βڣͼ੠Λ෼ੳ͢Δ͜ͱ Ͱڣͼ੠ݕग़ੑೳͷ޲্͕ظ଴Ͱ͖Δɽ զʑ͸ɼ͜Ε·Ͱʹ

Rahmonic

ͱݺ͹ΕΔجຊप೾਺ͷ௿ௐ ೾੒෼͕ڣͼ੠ݕग़ʹ༗ޮͰ͋Δ͜ͱΛ໌Β͔ʹ͠ɼैདྷͷϝ ϧέϓετϥϜͱซ༻͠ͳ͕Βڣͼ੠Λݕग़͢Δํ๏ΛఏҊ͠ ͨ

[12]

ɽ͜ͷख๏Ͱ͸ɼେྔͷฏ੩Ի੠ͱڣͼ੠͔Βநग़ͨ͠ ϝϧέϓετϥϜͱ

Rahmonic

ʹج͍ͮͯߏஙͨ͠

GMM

Λ ༻͍ͯڣͼ੠Λݕग़͍ͯͨ͠ɽຊߘͰ͸ɼݕग़ʹ༻͍ΔԻڹϞ

σϧΛ

GMM

͚ͩͰͳ͘ɼ

HMM

ʢ

Hidden Markov Model

ʣ΍

DNN

ʢ

Deep Neural Network

ʣʹ֦ு͠ɼͦΕͧΕͷԻڹϞσ ϧʹରͯ͠૽Ի؀ڥԼʹ͓͚Δڣͼ੠ͷݕग़ੑೳΛධՁ͢Δɽ

2. Rahmonic

ͱϝϧέϓετϥϜΛ༻͍ͨڣ

ͼ੠ݕग़

2. 1

Ի੠ಛ௃ྔʢϝϧέϓετϥϜɾ

Rahmonic

ʣ զʑ͸ɼ͜Ε·Ͱʹ

Rahmonic

ͱϝϧέϓετϥϜΛ༻͍ͨ ڣͼ੠ݕग़๏ΛఏҊͨ͠

[12]

ɽϝϧέϓετϥϜ͸ɼਓؒͷௌ ֮ಛੑΛߟྀͨ͠έϓετϥϜ܎਺Ͱ͋ΓɼԻ੠ೝࣝͰ͸Իӆ Λಛఆ͢ΔͨΊͷ੠ಓಛ௃ྔͱͯ͠༻͍ΒΕ͍ͯΔ

[13]

ɽҰํɼ

Rahmonic

͸ɼجຊप೾਺ͷ௿ௐ೾੒෼Ͱ͋Γɼਓؒͷ੠ଳӡ ಈʹؔΘΔಛ௃Λදݱ͢Δ

[14]

ɽͦͯ͠ɼैདྷݚڀ

[12]

ʹ͓͍ ͯɼ͜ΕΒͷԻ੠ಛ௃ྔ͕ฏ੩Ի੠ͱڣͼ੠ͰҟͳΔ͜ͱ͕ใ ࠂ͞Ε͍ͯΔɽ ͜͜Ͱɼਤ

1

ͱਤ

2

ʹฏ੩Ի੠ͱڣͼ੠ʹର͢Δର਺ύϫʔ εϖΫτϧͱέϓετϥϜΛࣔ͢ɽ·ͣର਺ύϫʔεϖΫτϧ ʹண໨͢Δͱɼਤ

1(a)

ͷฏ੩Ի੠ΑΓ΋ਤ

1(b)

ͷڣͼ੠ͷௐ ೾੒෼͕ڧௐ͞ΕͯදΕ͍ͯΔ͜ͱ͕֬ೝͰ͖Δɽ·ͨέϓε τϥϜʹ͓͍ͯ΋ɼਤ

2(a)

ͷฏ੩Ի੠ʹ͸ݦஶʹදΕͳ͔ͬ ͨ

Rahmonic

Λਤ

2(b)

ͷڣͼ੠Ͱ͸໌֬ʹ֬ೝ͢Δ͜ͱ͕Ͱ ͖Δɽ͜ͷΑ͏ʹप೾਺ྖҬ΍έϓετϥϜྖҬͰ΋ฏ੩Ի੠ ͱڣͼ੠ͷؒʹࠩҟ͕͋Δ͜ͱ͔Β΋ɼϝϧέϓετϥϜ΍

Rahmonic

Λ༻͍Δ͜ͱͰߴਫ਼౓ʹڣͼ੠Λݕग़Ͱ͖ΔՄೳੑ ͕͋Δͱߟ͑ΒΕΔɽ 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 Frequency [kHz] Power [dB] 0 20 40 60 80 100 0 1 2 3 4 5 6 7 8 Frequency [kHz] Power [dB] (a) ฏ੩Ի੠ (b) ڣͼ੠ ਤ 1 ฏ੩Ի੠ͱڣͼ੠ͷର਺ύϫʔεϖΫτϧ 0 1 2 3 0 2 4 6 8 10 Quefrency [ms] Amplitude F0 Rahmonic 0 1 2 3 0 2 4 6 8 10 Quefrency [ms] Amplitude F0 Rahmonic (a) ฏ੩Ի੠ (b) ڣͼ੠ ਤ 2 ฏ੩Ի੠ͱڣͼ੠ͷέϓετϥϜ Result Normal speech model Shouted speech model GMM, HMM or DNN Normal speech Shouted speech Feature extraction Normal speech model Shouted speech model GMM, HMM or DNN Recorded speech Observed speech

[Shouted speech detection]

[GMM, HMM or DNN construction]

( ) MFCCs, ΔMFCCs, Rahmonic Feature extraction ( ) MFCCs, ΔMFCCs, Rahmonic Feature extraction ( ) MFCCs, ΔMFCCs, Rahmonic or Normal speech Shouted speech ਤ 3 ڣͼ੠ݕग़ΞϧΰϦζϜͷ֓ཁ

2. 2

ݕग़ΞϧΰϦζϜ ਤ

3

ʹڣͼ੠ͷݕग़खॱΛࣔ͢ɽڣͼ੠Λݕग़͢Δํ๏ͱ ͯ͠ɼ͸͡Ίʹ༧Ίऩ࿥ͨ͠ฏ੩Ի੠ͱڣͼ੠͔Βநग़ͨ͠

Rahmonic

ͱϝϧέϓετϥϜΛ༻͍ͯԻڹϞσϧΛߏங͢Δɽ ࣍ʹɼ࣮ࡍͷධՁ؀ڥͰऩ࿥ͨ͠؍ଌԻ੠͔Β

Rahmonic

ͱϝ ϧέϓετϥϜΛநग़͠ɼ͜ΕΒͷԻ੠ಛ௃ྔͱֶशͨ͠Իڹ ϞσϧΛ༻͍ͯ؍ଌԻ੠Λฏ੩Ի੠ͱڣͼ੠ͷ͍ͣΕ͔ʹ෼ྨ ͢Δɽ ैདྷख๏͸ԻڹϞσϧͱͯ͠

GMM

Λར༻͍ͯ͠Δ͕ɼຊߘ Ͱ͸ڣͼ੠ݕग़ʹ༻͍ΔԻڹϞσϧΛैདྷͷ

Gaussian Mixture

Model (GMM)

͔Β

Hidden Markov Model (HMM)

΍

Deep

Neural Network

ʢ

DNN

ʣʹ֦ுͯ͠ɼͦΕͧΕͷԻڹϞσϧ ͕ڣͼ੠ݕग़ʹ༩͑ΔӨڹΛධՁ͢Δɽ

GMM

͸؍ଌԻ੠ʹର ͢ΔฏۉతͳԻ੠ಛ௃ྔΛ༻͍ͯڣͼ੠ΛϞσϧԽ͍ͯ͠Δɽ

HMM

͸Ի੠ಛ௃ྔͷ࣌ؒతมԽΛදݱͰ͖ΔԻڹϞσϧͰ͋ Γɼڣͼ੠͸ฏ੩Ի੠ͱൺ΂ͯൃ࿩࣌ؒ΍ΤωϧΪʔͷ࣌ؒม ಈ͕ҟͳΔ͜ͱ

[15]

͔Βɼ

HMM

Λ༻͍Δ͜ͱͰԻ੠ಛ௃ྔͷ ࣌ؒߏ଄΋ߟྀ͢Δ͜ͱͰڣͼ੠ݕग़ͷੑೳվળ͕ظ଴Ͱ͖Δɽ ͦͯ͠ɼ

DNN

͸χϡʔϥϧωοτϫʔΫͷ

1

ͭͰ͋Γɼωοτ

— 2 —

284

(4)

-ද 1 ࣮ ݧ ৚ ݅

Training data Female speaker: 400 samples Male speaker: 400 samples Testing data Female speaker: 100 samples

Male speaker: 100 samples Sampling 16 kHz / 16 bit Acoustic 12 orders MFCC feature 12 orders ΔMFCC 1 order Rahmonic Acoustic 1. GMM model 2. HMM (̏ states) 3. DNN

Noise White noise, Speech bubble [18] SNR 0, 10, 20,∞ dB

Frame length 25 ms (Hamming window) Frame shift 10 ms ϫʔΫ಺Ͱਂ͍૚ߏ଄Λ༗͢Δɽಛʹೖྗ૚ΛԻڹಛ௃ྔʢຊ ߘͷ৔߹ɼϝϧέϓετϥϜ΍

Rahmonic

ʣɼग़ྗ૚Λൃ࿩༷ ࣜʢฏ੩Ի੠ͱڣͼ੠ʣͱͯ͠ରԠ෇͚Δ͜ͱͰɼ

DNN

Λڣ ͼ੠ݕग़ͷͨΊͷԻڹϞσϧͱͯ͠࢖༻͢Δ͜ͱ͕Ͱ͖Δɽ· ͨωοτϫʔΫʹೖྗ͞ΕͨԻڹಛ௃ྔʹରͯ͠ॏΈ෇͚Λߦ ͍ͳ͕Βग़ྗ૚·Ͱ఻ൖ͢Δաఔ͸ɼධՁ؀ڥʹґଘͤͣʹڣ ͼ੠ݕग़ʹ༗ޮͳಛ௃Λॏ఺తʹநग़Ͱ͖Δͱߟ͑ΒΕΔɽ

3.

ධ Ձ ࣮ ݧ

3. 1

࣮ ݧ ৚ ݅ ຊ࣮ݧͰ͸ɼΫϦʔϯԻ੠

(

உঁ֤

400

ൃ࿩

)

ʹࡶԻΛ

4

छ ྨͷ

SNR

ʢ

∞, 20, 10, 0 dB

ʣͰՃࢉֶͨ͠शԻ੠Λ༻͍ͯੑ ผґଘͷϚϧνίϯσΟγϣϯԻڹϞσϧ

(GMM

ɼ

3

ঢ়ଶͷ

HMM

ɼ

DNN

ʣΛߏஙͨ͠ɽ

GMM

ͱ

HMM

ͷࠞ߹਺͸ɼ

8

छ ྨʢ

1

ɼ

2

ɼ

4

ɼ

8

ɼ

16

ɼ

32

ɼ

64

ɼ

128

ʣΛ༻͍ͯධՁΛߦͬͨɽ· ͨ

GMM

ͱ

HMM

ͷߏஙʹ͸

HTK [16]

Λɼ

DNN

ͷߏஙʹ͸

Kaldi [17]

Λ༻͍ͨɽ

DNN

Ͱ༻͍Δ֤Իڹಛ௃ྔͷ౷߹ϑϨʔ Ϝ਺͸ɼ

1

ϑϨʔϜʢݱࡏϑϨʔϜͷΈʣɼ

7

ϑϨʔϜʢલޙ

3

ϑϨʔϜΛؚΉʣɼ

11

ϑϨʔϜʢલޙ

5

ϑϨʔϜΛؚΉʣͷ

3

छྨɼӅΕ૚͸

3

૚ʢ֤૚ͷૉࢠ਺͸

20

ʣͱͨ͠ɽ·ͨൃ࿩༷ ࣜͷࣝผͰ͸ɼ࿩ऀΦʔϓϯςετΛ૝ఆͯ͠ԻڹϞσϧͷֶ शͰ༻͍ͨԻ੠ͱ͸ҟͳΔ࿩ऀԻ੠Λ༻͍ͨɽԻ੠ಛ௃ྔͱ͠ ͯɼϝϧέϓετϥϜ୯ମɼ

Rahmonic

୯ମɼϝϧέϓετϥ Ϝͱ

Rahmonic

ซ༻ͷ

3

छྨͱͨ͠ɽࡶԻ͸ɼ

NOISEX-92 [18]

ΑΓϗϫΠτϊΠζͱεϐʔνόϒϧࡶԻΛ༻͍ͨɽධՁࢦඪ ͱͯ͠ɼશͯͷฏ੩Ի੠ͱڣͼ੠ͷ಺ɼਖ਼͘͠ൃ࿩༷͕ࣜࣝผ ͞ΕͨԻ੠αϯϓϧ਺ͷׂ߹

(

ࣝผ཰ʣ

[%]

Λ༻͍ͨɽ·ͨຊ ࣮ݧͰ༻ҙͰ͖ͨධՁԻ੠͕গྔͰ͋Δ͜ͱΛߟྀͯ͠ɼࠓճ ͸

5

෼ׂަࠩݕఆΛ࣮ࢪͨ͠ɽ

3. 2

࣮ ݧ ݁ Ռ ਤ

4

ͱਤ

5

ʹԻڹϞσϧͱͯ͠

GMM

ͱ

HMM

Λ༻͍ͨͱ ͖ͷࠞ߹਺ผͷฏۉࣝผ཰Λࣔ͢ɽ·ͣਤ

4

ͷ

GMM

Λ༻͍ͨ ݁ՌͰ͸ɼϝϧέϓετϥϜͱ

Rahmonic

Λ༻͍ͯࠞ߹਺

128

ͷԻڹϞσϧΛߏஙͨ͠৚݅ɼͦͯ͠Λਤ

5

ͷ

HMM

ͷ݁ՌͰ 65 70 75 80 85 90 Identification accuracy [%] 1 2 4 8 16 32 64 128 MFCCs Rahmonic MFCCs+Rahmonic GMM Mixed number ਤ 4 ฏۉࣝผ཰ʢԻڹϞσϧɿGMMʣ 65 70 75 80 85 90 Identification accuracy [%] 1 2 4 8 16 32 64 128 MFCCs Rahmonic MFCCs+Rahmonic HMM Mixed number ਤ 5 ฏۉࣝผ཰ʢԻڹϞσϧɿHMMʣ ਤ 6 ԻڹϞσϧ͝ͱͷฏۉࣝผ཰ ͸ɼϝϧέϓετϥϜͱ

Rahmonic

Λ༻͍ͯࠞ߹਺

64

ͷԻڹ ϞσϧΛߏஙͨ͠৚݅ʹ͓͍ͯߴ͍ࣝผੑೳΛୡ੒͢Δ͜ͱ͕ Ͱ͖ͨɽ͜ͷ͜ͱΑΓɼϝϧέϓετϥϜͱ

Rahmonic

Λซ༻ ͢Δ͜ͱͰڣͼ੠ͷಛ௃ΛޮՌతʹදݱͰ͖͍ͯΔ͜ͱ͕֬ೝ Ͱ͖ͨɽ·ͨԻڹϞσϧͷੑೳΛൺֱ͢Δͱɼ

GMM

͕

HMM

ΑΓ΋࠷ߴͷࣝผ཰Λ্ճͬͨɽ͜Ε͸

HMM

Ͱ͸Ի੠ಛ௃ྔ ͷ࣌ؒతมԽΛਖ਼֬ʹදݱͰ͖͍ͯͳ͔ͬͨ͜ͱΛ͓ࣔͯ͠Γɼ ࠓޙ͸ฏ੩Ի੠ͱڣͼ੠ͷಛ௃ྔʹର͢Δ۩ମతͳ࣌ؒతมԽ Λ෼ੳ͢Δඞཁ͕͋Δɽ ࣍ʹද

2

ʙ

3

ʹ

DNN

Λ༻͍ͨൃ࿩༷ࣜͷࣝผ཰ʢஉঁผʣɼ ਤ

6

ʹԻڹϞσϧ͝ͱͷฏۉࣝผ཰Λࣔ͢ɽදதͷଠࣈ͸֤؀ ڥʹ͓͚Δ࠷ߴࣝผ཰Λɼͦͯ͠ʮ

M

ʯ

,

ʮ

R

ʯ

,

ʮ

M+R

ʯ͸ɼͦ ΕͧΕϝϧέϓετϥϜɼ

Rahmonic

ɼ྆ಛ௃ྔซ༻ͷ݁ՌΛ ࣔ͢ɽ·ͣ

DNN

Λ༻͍ͨൃ࿩༷ࣜͷࣝผ཰ʹண໨͢Δͱɼࡶ Իͷ

SNR

ʹؔ܎ͳ͘શͯͷ؀ڥʹ͓͍ͯ

90 %

Ҏ্ͷࣝผ཰Λ ୡ੒͢Δ͜ͱ͕Ͱ͖ͨɽಛʹϝϧέϓετϥϜͱ

Rahmonic

Λ

(5)

ද 2 ঁੑ࿩ऀͷࣝผ཰ [%] (ԻڹϞσϧɿDNN) Number of SNR=∞ dB SNR=20 dB SNR=10 dB SNR=0 dB input frames M R M+R M R M+R M R M+R M R M+R 1 frame 92.3 69.5 90.5 94.4 59.4 93.7 92.5 68.4 91.6 90.8 87.4 92.2 7 frames 95.3 68.0 95.6 96.4 65.9 96.9 95.3 75.7 95.7 94.2 93.4 95.7 11 frames 95.1 68.6 95.7 96.5 66.6 96.8 95.5 75.7 95.8 96.4 94.9 95.5 *M: MFCCs, R: Rahmonic, M+R: MFCCs and Rahmonic

ද 3 உੑ࿩ऀͷࣝผ཰ [%] (ԻڹϞσϧɿDNN) Number of SNR=∞ dB SNR=20 dB SNR=10 dB SNR=0 dB input frames M R M+R M R M+R M R M+R M R M+R 1 frame 84.9 73.7 93.4 90.5 70.4 94.8 90.3 73.8 94.3 82.7 61.7 87.1 7 frames 89.4 80.7 95.1 93.8 79.1 96.1 93.3 80.9 95.7 87.5 54.9 92.2 11 frames 91.2 81.3 94.5 94.5 80.1 96.0 94.1 80.7 96.1 88.6 53.4 91.5 *M: MFCCs, R: Rahmonic, M+R: MFCCs and Rahmonic

ซ༻ͯ͠

DNN

Λߏஙͨ͠৔߹ɼന৭ࡶԻʢ

SNR=0 dB

ʣΛআ ͘શͯͷ؀ڥʹ͓͍ͯ࠷΋ࣝผ཰͕ߴ͔ͬͨɽ͜ͷ݁Ռ͔Β΋ ൃ੠ػߏʢ੠ಓಛੑͱ੠ଳಛੑʣΛͦΕͧΕϝϧέϓετϥϜ ͱ

Rahmonic

Λ༻͍ͯޮ཰Α͘දݱͰ͖͍ͯΔͱߟ͑ΒΕΔɽ ͦͯ͠ԻڹϞσϧ͝ͱͷ݁Ռʹண໨͢Δͱɼ

GMM

΍

HMM

Λ༻͍ͨฏۉࣝผ཰͸શͯ

90 %

ΛԼճͬͨͷʹରͯ͠ɼϝϧ έϓετϥϜͱ

Rahmonic

Λซ༻ͯ͠

DNN

Λߏஙͨ͠৔߹ͷ ฏۉࣝผ཰͕

95 %

Ҏ্Λୡ੒ͨ͠ɽ͜Ε͸

DNN

͕

GMM

΍

HMM

ͱൺֱͯ͠ࡶԻͷӨڹΛड͚ͣʹڣͼ੠ݕग़ʹ༗ޮͳಛ ௃Λॏ఺తʹநग़Ͱ͖ͨͨΊͩͱߟ͑ΒΕΔɽҎ্ͷ͜ͱΑΓɼ ϝϧέϓετϥϜͱ

Rahmonic

ʹج͍ͮͯߏஙͨ͠

DNN

͕ڣ ͼ੠ݕग़ʹ༗ޮͰ͋Δ͜ͱΛ֬ೝͰ͖ͨɽ

4.

͓ Θ Γ ʹ

ຊߘͰ͸ɼ

Rahmonic

ͱϝϧέϓετϥϜΛ༻͍ͨڣͼ੠ ݕग़ʹ͓͍ͯɼൃ࿩༷ࣜͷࣝผʹ༻͍ΔԻڹϞσϧΛैདྷͷ

GMM

͔Β

HMM

΍

DNN

΁֦ுͯ͠ɼԻڹϞσϧͷҧ͍͕ڣ ͼ੠ͷݕग़ੑೳʹ༩͑ΔӨڹΛධՁͨ͠ɽ࣮ݧ݁ՌΑΓɼͲͷ ԻڹϞσϧΛ༻͍ͨ৔߹Ͱ΋

Rahmonic

ͱϝϧέϓετϥϜΛ ซ༻͢Δ͜ͱͰߴ͍ڣͼ੠ݕग़ੑೳΛ࣮ݱ͢Δ͜ͱ͕Ͱ͖ͨɽ ͞Βʹڣͼ੠ݕग़ʹ༗ޮͳಛ௃Λॏ఺తʹநग़Ͱ͖Δ

DNN

Λ ༻͍Δ͜ͱͰࡶԻͷछྨ΍

SNR

ʹؔ܎ͳ͘

90 %

Ҏ্ͷڣͼ੠ ݕग़ੑೳΛୡ੒্ͨ͠ʹɼैདྷͷ

GMM

΍

HMM

ΑΓ΋ڣͼ ੠ͷݕग़ੑೳ͕վળͨ͠ɽࠓޙ͸ɼ࣮؀ڥΛ૝ఆͯ͠ࡶԻ͚ͩ Ͱͳ͘࢒ڹ΋ࠞೖ͢Δ؀ڥͰͷڣͼ੠ݕग़ධՁʹऔΓ૊Ήܭը Ͱ͋Δɽ ँࣙ ຊݚڀͷҰ෦͸ɼՊݚඅʢ

16K16094

ʣͷݚڀॿ੒Λ ड͚ͨɽ จ ݙ

[1] W. Yao-Dong, T. Takeshi, and I. Idaku, “HFR-video-based machinery surveillance for high-speed periodic operations,” Journal of System Design and Dynamics, vol. 5, no. 6, pp. 1310-1325, 2011.

[2] W. Huang, T. K. Chiew, H. Li, T. S. Kok, and J. Biswas, “Scream detection for home applications,” 5th IEEE Con-ference on Industrial Electronics and Applications, pp. 2115-2120, 2010.

[3] M. Cowling, “Comparison of techniques for environmental

sound recognition,” Pattern Recognition Letter, vol. 24, no. 15, pp. 2895-2907, 2003.

[4] K. M. Kim, J. W. Jung, S. Y. Chun, and K. S. Park, “Acoustic intruder detection system for home security,” IEEE Transaction on Consumer Electronics, vol. 51, no. 1, pp. 130-138, 2005.

[5] K. Hayashida, J. Ogawa, M. Nakayama, T. Nishiura, and Y. Yamashita, “Multi-stage identification for abnor-mal/warning sounds detection based on maximum likeli-hood classification,” ICA2013, PaperID:1pSPb4, 2013. [6] J. L. Rouas, J. Louradour, and S. Ambellouis, “Audio

events detection in public transport vehicle,” IEEE Intelli-gent Transportation Systems Conference, pp. 733-738, 2006. [7] P. K. Atrey, N. C. Maddage, and M. S. Kankanhalli, “Audio based event detection for multimedia surveillance,” ICASSP 2006, pp. 813-816, 2006.

[8] S. Ntalampiras, I. Potamitis, and N. Fakotakis, “An adap-tive framework for acoustic monitoring of potential haz-ards,” EURASIP Journal on Audio, Speech, and Music Processing, 2009.

[9] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, “Scream and gunshot detection and localization for audio-surveillance systems,” IEEE Conference on Advanced Video and Signal Based Surveillance, pp. 21-26, 2007. [10] J. Pohjalainen, P. Alku, and T. Kinnunen, “Shout detection

in noise,” ICASSP 2011, pp. 4968-4971, 2011.

[11] W. Huang, T. K. Chiew, H. Li, T. S. Kok, and J. Biswas, “Scream detection for home applications,” Industrial Elec-tronics and Applications 2010, pp. 2115-2120, 2010. [12] ֟໺ ௚ਓ, ෱৿ ོ׮, தࢁ խਓ, ੢Ӝ ܟ৴, ೆᑍ ߒً,

“Rah-monic ͱϝϧέϓετϥϜΛ༻͍ͨڣͼ੠ݕग़ͷݕ౼,” ೔ຊԻ ڹֶձ 2013 ೥ळقݚڀൃදձ, pp. 169-170, 2013.

[13] J. Benesty, M. M. Sondhi, and Y. Huang, “Springer hand-book of speech processing,” Springer, 2008.

[14] A. M. Noll, “Cepstrum Pitch Determination,” Journal of the Acoustical Society of America, vol. 41, no. 2, pp. 203-309, 1967.

[15] C Zhang and J.H.L Hansen “Analysis and classification of speech mode: whispered through shouted,” INTER-SPEECH 2007, pp. 2289-2292, 2007.

[16] HTK Software Toolkit, http://htk.eng.cam.ac.uk/ [17] Kaldi, http://kaldi-asr.org/doc/index.html

[18] A. Varga and H.J.M. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol. 12, no. 3, pp. 247-251.

— 4 —

参照

関連したドキュメント

It is assumed that the reader is familiar with the standard symbols and fundamental results of Nevanlinna theory, as found in [5] and [15].. Rubel and C.C. Zheng and S.P. Wang [18],

Keywords and Phrases: Profinite cohomology, lower p-central filtra- tion, Lyndon words, Shuffle relations, Massey

The time-frequency integrals and the two-dimensional stationary phase method are applied to study the electromagnetic waves radiated by moving modulated sources in dispersive media..

RIMS has each year welcomed around 4,000 researchers in the mathematical sciences in Japan and more than 200 from abroad, who either come as long-term research visitors or

Amount of Remuneration, etc. The Company does not pay to Directors who concurrently serve as Executive Officer the remuneration paid to Directors. Therefore, “Number of Persons”

WHO Technical Report Series, No.992, Annex5, Supplement 8の「Temperature mapping of storage areas Technical supplement to WHO Technical Report Series, No..

*4 IAEA Technical Report Series No.422, “Sediment Distribution Coefficients and Concentration Factors for Biota in the Marine

of its rated output voltage under normal operating conditions, whichever is higher.. For equipment with multiple rated output voltages, the requirements apply with the