客観評価実験結果

5.5 客観評価実験

5.5.1 客観評価実験結果

図15, 図16(a), (b) はi-vectorのEER, x-vectorのEER とPESQ 及びSTOI の平均値の関係をそれぞれ示したものである. ^まずi-vector, x-vector^のEERと客観評価の相関はどちらも傾向が同じため,^図15^に焦点を当てて,議論することにする. 図15図16(c)より, EERとPESQ, STOI のスコアは相関が低いことがわかる. 自然性が最も高い帯域拡張法はPESQにおいては(E)UPであり, STOIにおいては(F)Shiftであるが,本実験は話者照合実験の精度を上げることを目的としており,^自然性の向上は目的としていないため,^{その点は言及しない}. ^{次に原音声} との距離を測るRMS-LSDとEERの関係を図15(c), 図16(c) に示す.

RMS-LSD は値が低いほど原音声に近いことを表すため, (H)N-BWE

はEER, RMS-LSD とともに低いことがわかる. 以上のことより, 客

観評価値とEERに強い相関はなかったが, N-BWEはEER, RMS-LSD ともに低いことから性能のいい手法である.

5. 英語データベースにおいての実験 32

(E)Up (F)Shift

(G)Lpas (H)N-BWE

1.5 1.7 1.9 2.1 2.3 2.5

12 14 16

RMS-LSD

EER(%) (E)Up

(F)Shift

(G)Lpas (H)N-BWE

1.5 1.7 1.9 2.1 2.3 2.5

10 12 14 16

RMS-LSD

EER(%)

(E)Up

(F)Shift (G)Lpas (H)N-BWE

0.92 0.94 0.96 0.98

12 14 16

STOI

EER(%) (E)Up

(F)Shift (G)Lpas (H)N-BWE

0.92 0.94 0.96 0.98

10 12 14 16

STOI

EER(%)

(H)N-BWE (F)Shift

(G)Lpas

(E)Up 1.3

1.35 1.4 1.45

12 14 16

PESQ

EER(%) (E)Up

(F)Shift (G)Lpas (H)N-BWE

1.3 1.35 1.4 1.45

10 12 14 16

PESQ

EER(%)

Good

Bad

Development Evaluation

Good

Bad

Good

(a)PESQ

(b)STOI

(c)RMS-LSD

図15: Relationships between objective results and thier EERs (i-vector)

5. 英語データベースにおいての実験 33

(E)Up (F)Shift

(G)Lpas (H)N-BWE

1.5 1.7 1.9 2.1 2.3 2.5

10 12 14

RMS-LSD

EER(%) (E)Up

(F)Shift

(G)Lpas (H)N-BWE

1.5 1.7 1.9 2.1 2.3 2.5

8.5 11 13.5

RMS-LSD

EER(%)

(E)Up

(F)Shift (G)Lpas (H)N-BWE

0.92 0.94 0.96 0.98

10 12 14

STOI

EER(%) (E)Up

(F)Shift (G)Lpas (H)N-BWE

0.92 0.94 0.96 0.98

8.5 11 13.5

STOI

EER(%)

(H)N-BWE (F)Shift

(G)Lpas

(E)Up 1.3

1.35 1.4 1.45

10 12 14

PESQ

EER(%) (E)Up

(F)Shift (G)Lpas (H)N-BWE

1.3 1.35 1.4 1.45

8.5 11 13.5

PESQ

EER(%)

Good

Bad

Development Evaluation

Good

Bad

Good

(a)PESQ

(b)STOI

(c)RMS-LSD

図16: Relationships between objective results and thier EERs (x-vector)

6. ^結論 34

6 ^結論

本論文では, i-vector/PLDA, xvector/PLDAに基づく話者照合システ

ムによるN-BWE の効果を評価することを目的とした. 登録部と照

合部においてテストサンプリング周波数の不一致を解消するために, 登録部の音声をダウンサンプリングし,学習し直すことは非常にコストがかかってしまうため望ましくない. そこで本論文は照合部の帯域制限された音声のみを帯域拡張しN-BWE^{などを適用した場合の} i-vector/PLDA, x-vector/PLDAに基づく話者照合システムへの影響を調査した. ^{実験結果より}, アップサンプリングした音声や他の帯域拡張法と比較してN-BWE を適用した音声の方が話者照合実験においてEERが改善することを確認した. またx-vectorを用いた手法でも

i-vectorを用いた手法の場合と同様の傾向が出ることを確認し, N-BWE

がx-vectorに基づく話者照合においても有効であることを示した.

7. ^謝辞 35

7 ^謝辞

本研究では首都大学東京システムデザイン学部情報通信システムコースにおいて多くの方々のご協力, ご指導のもとにすすめたものであります. ^はじめに, 指導教員である貴家仁志教授, ^{塩田さやか助教} には本研究の全般にわたり,その執筆, 進行,発表に関して様々なご指導, ご助言をいただきました. 特に塩田さやか助教には本研究のみならず, ^{予稿の作成}, 資料の作成法など各方面においてご指導いただきました. ここに深く御礼申し上げます. また，小野順貴教授、高間康史教授には，本論文の審査を通して貴重なご助言とご指導を賜り深く感謝の意を表します．著者が在学中にお世話になった研究室の先輩方, 同輩方に感謝いたします. 最後にこれまでの学生生活を見守り, 辛抱強く支援してくださった両親に深い感謝の意を表して謝辞といたします.

8. ^参考文献 36

8 ^参考文献

[1] Najim Dehak, Patrick J Kenny, R´eda Dehak, Pierre Dumouchel, and Pierre Ouellet, “Front-end factor analysis for speaker verification,”

IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2011.

[2] Simon JD Prince and James H Elder, “Probabilistic linear discrim-inant analysis for inferences about identity,” in Computer Vision, 2007. ICCV 2007. IEEE 11th International Conference on. IEEE, 2007, pp. 1–8.

[3] David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur, “Deep neural network embeddings for text-independent speaker verification,” in Proc. Interspeech, 2017, pp. 999–1003.

[4] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno, “Gen-eralized end-to-end loss for speaker verification,” in 2018 IEEE In-ternational Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4879–4883.

[5] Johan Rohdin, Anna Silnova, Mireia Diez, Oldˇrch Plchot, Pavel Matˇejka, and Luk´aˇs Burget, “End-to-end dnn based speaker recog-nition inspired by i-vector and plda,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

IEEE, 2018, pp. 4874–4878.

[6] Niko Brummer, Anna Silnova, Lukas Burget, and Themos Stafy-lakis, “Gaussian meta-embeddings for eﬃcient scoring of a heavy-tailed plda model,” arXiv preprint arXiv:1802.09777, 2018.

[7] Fahimeh Bahmaninezhad and John HL Hansen, “i-vector/plda speaker recognition using support vectors with discriminant anal-ysis,” in Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5410–5414.

[8] David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur, “Spoken language recog-nition using x-vectors,” in Odyssey: The Speaker and Language Recognition Workshop, Les Sables d’Olonne, 2018.

[9] Suwon Shon, Hao Tang, and James Glass, “Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model,” arXiv preprint arXiv:1809.04437, 2018.

[10] Yi Liu, Liang He, Jia Liu, and Michael T Johnson, “Speaker embedding extraction with phonetic information,” arXiv preprint arXiv:1804.04862, 2018.

[11] H. Miyamoto, S. Shiota, and H. Kiya, “Non-linear harmonic genera-tion based blind bandwidth extension considering aliasing artifacts,”

in Proc. APSIPA Annual Summit and Conference, 2018.

[12] Phani Sankar Nidadavolu, Cheng-I Lai, Jes´us Villalba, and Najim Dehak, “Investigation on bandwidth extension for speaker recogni-tion,” Proc. Interspeech 2018, pp. 1111–1115, 2018.

[13] H. Pulakka, L. Laaksonen, M. Vainio, J. Pohjalainen, and P. Alku,

“Evaluation of an artificial speech bandwidth extension method in three languages,” IEEE Trans. Audio, Speech, and Language. Pro-cess., vol. 16, no. 6, pp. 1124–1137, 2008.

[14] Kaavya Sriskandaraja, Vidhyasaharan Sethu, Phu Ngoc Le, and Eliathamby Ambikairajah, “Investigation of sub-band

discrimina-38

tive information between spoofed and genuine speech.,” in INTER-SPEECH, 2016, pp. 1710–1714.

[15] H. Seo, H.G. Kang, and F. Soong, “A maximum a posterior-based reconstruction approach to speech bandwidth expansion in noise,”

in Proc. ICASSP 2014, pp. 6087–6091, 2014.

[16] P. N. Le, E. Ambikairajah, E. H. Choi, and J. Epps, “A nonuniform subband approach to speech-based cognitive load classification,” in Proc. ICICS 2009, pp. 1–5, 2009.

[17] Has¸im Sak, Andrew Senior, and Franc¸oise Beaufays, “Long short-term memory recurrent neural network architectures for large scale acoustic modeling,” in Fifteenth annual conference of the interna-tional speech communication association, 2014.

[18] C. Mori, K. Tanioka, and S. Gohshi, “Super resolution image re-construction and imaging device,” pp. 588–593, 2016.

[19] Takashi Fukuda, Masayuki Suzuki, Gakuto Kurata, Samuel Thomas, Jia Cui, and Bhuvana Ramabhadran, “Eﬃcient knowledge distillation from an ensemble of teachers,” Proc. Interspeech 2017, pp. 3697–3701, 2017.

[20] T Thiruvaran, V Sethu, E Ambikairajah, and H Li, “Spectral shifting of speaker-specific information for narrow band telephonic speaker recognition,” Electronics Letters, vol. 51, no. 25, pp. 2149–2151, 2015.

[21] P. Bachhav, M. Todisco, and N. Evans, “Eﬃcient super-wide band-width extension using linear prediction based analysis-synthesis,”

in Proc. IEEE International Conference on Acoustics, Speech and Signal, pp. 5429–5433, 2018.

[22] Jacob Benesty, M Mohan Sondhi, and Yiteng Huang, Springer handbook of speech processing, springer, 2007.

[23] Thierry Dutoit and Ferran Marques, Applied Signal Processing: A MATLABTM-based proof of concept, Springer Science & Business Media, 2010.

[24] Katunobu Itou, Mikio Yamamoto, Kazuya Takeda, Toshiyuki Takezawa, Tatsuo Matsuoka, Tetsunori Kobayashi, Kiyohiro Shikano, and Shuichi Itahashi, “Jnas: Japanese speech corpus for large vocabulary continuous speech recognition research,” Journal of the Acoustical Society of Japan (E), vol. 20, no. 3, pp. 199–206, 1999.

[25] Sayaka Shiota, Fernando Villavicencio, Junichi Yamagishi, Nobu-taka Ono, Isao Echizen, and Tomoko Matsui, “Voice liveness de-tection algorithms based on pop noise caused by human breath for automatic speaker verification,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.

[26] ^{上西遼大},^{塩田さやか}, and^{貴家仁志}, “i-vector^{を用いた話者照} 合のための非線形帯域拡張法及びフィルタ設計に関する検討,”

電子情報通信学会音声研究会, vol. 117, no. 189, pp. 29–32, 2017.

[27] ^{上西遼大}, ^{塩田さやか}, and^{貴家仁志}, “^{話者照合のための回り} 込みを考慮した非線形帯域拡張法と通信音声による評価,” 日本音響学会春季大会, , no. 2-Q-2, pp. 135–136, 2018.

[28] Recommendation G.712. ITU-T, “Transmission performance char-acteristics for pulse code modulation channels,” 1996.

[29] Recommendation G.711. ITU-T, “Pulse code modulation (pcm) of voice frequencies,” 1989.

[30] ^{上西遼大}, ^{塩田さやか}, and^{貴家仁志}, “i-vector/plda^{に基づく話} 者照合による非線形帯域拡張法の評価,” 情報処理学会音声言語情報処理研究会, , no. 14, 2018.

[31] ^{上西遼大},^{塩田さやか}, and^{貴家仁志}, “x-vector^{に基づく話者照} 合における非線形帯域拡張法の評価,” 電子情報通信学会音声研究会, 2019.

[32] Daniel Povey, Arnab Ghoshal, Gilles Boulianne, Lukas Burget, On-drej Glembek, Nagendra Goel, Mirko Hannemann, Petr Motlicek, Yanmin Qian, Petr Schwarz, et al., “The kaldi speech recognition toolkit,” in IEEE 2011 workshop on automatic speech recognition and understanding. IEEE Signal Processing Society, 2011, number EPFL-CONF-192584.

[33] Mitchell McLaren, Luciana Ferrer, Diego Castan, and Aaron Law-son, “The speakers in the wild (sitw) speaker recognition database.,”

in Interspeech, 2016, pp. 818–822.

[34] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman, “Vox-celeb: a large-scale speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017.

[35] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman,

“Voxceleb2: Deep speaker recognition,” arXiv preprint arXiv:1806.05622, 2018.

[36] David Snyder, Guoguo Chen, and Daniel Povey, “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.

[37] Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L Seltzer, and Sanjeev Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in Acoustics, Speech and

Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2017, pp. 5220–5224.

[38] Erik Larsen, Ronald M Aarts, and Michael Danessis, “Eﬃcient high-frequency bandwidth extension of music and speech,” inAudio Engineering Society Convention 112. Audio Engineering Society, 2002.

[39] AW. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual eval-uation of speech quality (pesq), an objective method for end-to-end speech quality assessment of narrowband telephone networks and speech codecs,” ITU-T Recommendation, vol. 862, 2001.

[40] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algo-rithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE Trans. Audio, Speach, Language. Process., vol. 19, no. 7, pp. 2125–2136, 2011.

発表論文

[1] ^上西遼大, ^{塩田さやか}, ^貴家仁志, “i-vector^{を用いた話者照合のた} めの非線形帯域拡張法及びフィルタ設計に関する検討,” ^電子情報通信学会音声研究会, vol.117, no.189, pp.29-32, 2017年8月30日 [2] ^上西遼大,^{塩田さやか},^貴家仁志, “i-vectorを用いた話者照合のための回り込みを考慮した非線形帯域拡張法と通信音声による評価,”

日本音響学会春季大会, no.2-Q-2, pp.135-136, 2018年3月14日 [3] ^上西遼大,^{塩田さやか},^貴家仁志, i-vector/PLDA^{に基づく話者照}

合による非線形帯域拡張法の評価, 情報処理学会音声言語情報研究会, vol2018-125, no.14, 2018^年12^月10^日

[4] ^上西遼大, ^{塩田さやか}, ^貴家仁志, x-vector^{に基づく話者照合に} おける非線形帯域拡張法の評価, 電子情報通信学会音声研究会, 2019^年3^月15^日

ドキュメント内システムデザイン研究科情報通信システム学域 (ページ 36-48)

5.5 客観評価実験

5.5.1 客観評価実験結果

6 結論

7 謝辞

8 参考文献

発表論文

6 ^結論

7 ^謝辞

8 ^参考文献