• 検索結果がありません。

韻律・音韻の部分補正に基づく話者性を保持した日本人英語音声合成と英語習熟度が与える影響

N/A
N/A
Protected

Academic year: 2021

シェア "韻律・音韻の部分補正に基づく話者性を保持した日本人英語音声合成と英語習熟度が与える影響"

Copied!
6
0
0

読み込み中.... (全文を見る)

全文

(1)Vol.2015-SLP-105 No.3 2015/2/27. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ӆ཯ɾԻӆͷ෦෼ิਖ਼ʹ‫ͮ͘ج‬࿩ऀੑΛอ࣋ͨ͠೔ຊਓӳ‫ޠ‬ Ի੠߹੒ͱӳ‫ޠ‬शख़౓͕༩͑ΔӨ‫ڹ‬ େౡ ༔࢘1. ߴಓ ৻೭հ1. ‫ ాށ‬ஐ‫ج‬1. Sakriani Sakti1. Graham Neubig1. தଜ ఩1. ֓ཁɿ੠࣭ม‫׵‬΍ HMM Ի੠߹੒Λ༻͍ͨ೔ӳؒΫϩεϦϯΨϧԻ੠߹੒͸ɼಉ‫ؒޠݴ‬ͷ৔߹ͱൺֱ͠ ͯɼ࿩ऀੑͷ௿͍Ի੠Λ߹੒͢Δ܏޲ʹ͋Δɽ͜Εʹର͠զʑ͸ɼ೔ຊਓӳ‫ޠ‬ʢERJ: English Read by. Japaneseʣͷར༻ɼ·ͨɼ೔ຊਓӳ‫ޠ‬ͷӆ཯‫ޡ‬Γʹର͢Δӆ཯ิਖ਼๏ʹΑΓɼ࿩ऀੑΛ‫͘ڧ‬൓өͭͭࣗ͠વ ੑΛվળ͢Δख๏ΛఏҊ͍ͯ͠Δɽ͔͠͠ͳ͕ΒɼධՁऀͷ฼‫ൃͱޠ‬࿩ऀͷӳ‫ޠ‬शख़౓ʹର͢Δิਖ਼๏ͷ Ө‫ڹ‬ͷҧ͍͸े෼ʹௐࠪ͞Ε͓ͯΒͣɼ·ͨɼ೔ຊਓӳ‫ޠ‬ͷࣗવੑ௿ԼͷཁҼͰ͋ΔԻӆ‫ޡ‬Γʹ͍ͭͯ΋ ߟྀ͞Ε͍ͯͳ͍ɽຊߘͰ͸ɼධՁऀͷ฼‫ൃͱޠ‬࿩ऀͷӳ‫ޠ‬शख़౓͕ӆ཯ิਖ਼ͷޮՌʹ༩͑ΔӨ‫ڹ‬Λௐࠪ ͢Δͱͱ΋ʹɼ৽ͨʹແ੠ࢠԻεϖΫτϧஔ‫ͮ͘جʹ׵‬Իӆิਖ਼๏ΛఏҊ͢Δɽ࣮‫ݧ‬తධՁʹΑΓɼ ʢ̍ʣ ύϫʔิਖ਼ʹΑΔࣗવੑͷվળޮՌ͸ɼӳ‫ޠ‬฼‫ޠ‬࿩ऀʹΑΔධՁʹ͓͍ͯ‫ݦ‬ஶʹ‫ݟ‬ΒΕΔ͜ͱɼ ʢ̎ʣӳ‫ޠ‬ शख़౓ʹؔΘΒͣɼӆ཯ิਖ਼๏ʹΑΓࣗવੑ͕վળ͢Δ͜ͱɼ ʢ̏ʣԻӆิਖ਼๏΋ࣗવੑվળʹ༗ޮͰ͋Δ ͜ͱΛࣔ͢ɽ Ωʔϫʔυɿ೔ຊਓӳ‫ޠ‬, HMM Ի੠߹੒, ӆ཯ิਖ਼, Իӆิਖ਼, ࿩ऀੑ. English-Read-By-Japanese Speech Synthesis Preserving Speaker Individuality Based on Partial Correction of Prosody and Phonetic Sounds and Effects of English Proficiency Level on Its Performance Yuji Oshima1. Shinnosuke Takamichi1 Tomoki Toda1 Satoshi Nakamura1. Sakriani Sakti1. Graham Neubig1. Abstract: Cross-lingual speech synthesis for generating naturally sounding English speech uttered by Japanese speakers based on voice conversion and HMM-based speech synthesis tends to cause the degradation of speaker individuality in synthetic speech compared to intra-lingual speech synthesis. To address this issue, we have proposed an ERJ(English Read by Japanese) speech synthesis method to preserve speaker individuality in synthetic speech and a prosody correction method to improve its naturalness. However, their effectiveness has never been evaluated by native listeners: the effects of each speaker’s English proficiency level on their performance have never been evaluated; and incorrect phonetic sounds of ERJ have never been addressed. In this paper, we evaluate these points by applying the proposed method to multiple speakers with various English proficiency levels and also propose a correction method of some incorrect phonetic sounds based on spectrum swapping for unvoiced consonants. The experimental results demonstrate that (1) the effectiveness of power correction is well confirmed by native listeners; (2) the naturalness of ERJ synthetic speech is successfully improved over various English prociency levels by the prosody correction method; and (3) the proposed phonetic sound correction method is also effective for further improving its naturalness. Keywords: English-Read-by-Japanese (ERJ), HMM-based speech synthesis, prosodical correction, phonetic correction, speaker individuality. ⓒ 2015 Information Processing Society of Japan. 1.

(2) Vol.2015-SLP-105 No.3 2015/2/27. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ƵƌĂƚŝŽŶ ŵŽĚĞů. 1. ͸͡Ίʹ ΫϩεϦϯΨϧԻ੠߹੒͸ɼ͋Δ‫ޠݴ‬ͷൃ࿩ऀͷ࿩ऀੑ Λҟ‫ޠݴ‬ͷ߹੒Ի੠ʹ൓өͤ͞Δٕज़Ͱ͋Γɼ࿩ऀੑʹΑ Δ৘ใ‫ݯ‬ͷಛఆΛଅ͠ɼԁ‫ͳ׈‬ίϛϡχέʔγϣϯΛଅਐ ͢Δ໾ׂΛ୲͏ɽಛʹ೔ຊͰ͸ɼ೔ӳؒʹ͓͚Δ߹੒ٕज़ ͷधཁ͕ߴ͘ɼԻ੠຋༁γεςϜɼւ֎өըͷਧ͖ସ͑΍. 䠆 ĚĂƉƚĂƚŝŽŶ. 䠖ĚĂƉƚĂƚŝŽŶĚĂƚĂ. ĚĂƉƚďLJ Ƌ͘;ϮͿΘ;ϯͿ. ,DD ^ƉĞĐƚƌĂů ŵŽĚĞů. ^ŽƵƌĐĞ ƐƉĞĂŬĞƌ͛Ɛ ŵŽĚĞů. džĐŝƚĂƚŝŽŶ ŵŽĚĞů. 䠆 䠆 䠆 䠆 䠆 dĂƌŐĞƚ ^ƉĞĂŬĞƌ͛Ɛ ŵŽĚĞů. ਤ 1 HMM Ի੠߹੒ʹ͓͚ΔϞσϧదԠ. CALL γεςϜ [1] ΁ͷԠ༻͕‫ظ‬଴͞ΕΔɽ ͜Ε·Ͱʹɼ౷‫ܭ‬త੠࣭ม‫ٕ׵‬ज़ [2] ΍ӅΕϚϧίϑϞσ. Fig. 1 Model adaptation in HMM-based speech synthesis.. ϧ ʢHMM: Hidden Markov Modelʣʹ‫ͮ͘ج‬Ի੠߹੒ [3]. ͋Δ࿩ऀͷ HMM ͔Β໨ඪ࿩ऀͷ HMM ΛߏஙͰ͖Δɽ༧. ʹ͓͚Δ࿩ऀదԠٕज़ [4] ʹ͓͍ͯɼӳ‫ޠ‬Λ฼‫͢ͱޠ‬Δ࿩. Ίֶश͓͍ͯͨ͠దԠ‫ݩ‬Ϟσϧͱ໨ඪ࿩ऀͷదԠσʔλΛ. ऀͷԻ੠ʹରͯ͠ɼόΠϦϯΨϧԻ੠΍೔ຊ‫ޠ‬Ի੠ͱ͍ͬ. ༻͍ͯɼదԠ‫ݩ‬ϞσϧͷύϥϝʔλΛม‫͢ܗ‬Δ͜ͱͰɼ໨. ͨࣗવੑͷߴ͍Ի੠σʔλΛ‫ͨ͠༻׆‬࿩ऀม‫ॲ׵‬ཧΛࢪ͢. ඪ࿩ऀ΁ͱదԠ͞ΕͨϞσϧ͕ಘΒΕΔɽదԠ‫ޙ‬ͷฏ‫ۉ‬ϕ ˆ c ͸࣍ࣜͰ‫͞ࢉܭ‬ΕΔɽ ˆ ͱ‫ڞ‬෼ࢄߦྻ Σ Ϋτϧ μ. ख๏ [5], [6], [7] ͕޿͘‫͞ڀݚ‬Ε͍ͯΔɽ͜ΕΒͷख๏͸ɼ ൺֱతߴ͍ࣗવੑΛ࣋ͭӳ‫ޠ‬Ի੠Λ߹੒Ͱ͖ΔҰํͰɼಉ Ұ‫͚͓ʹؒޠݴ‬Δ߹੒Ի੠ͱൺֱ͢Δͱɼ࿩ऀੑͷྼԽΛ ট͘܏޲͕͋Δ [5]ɽ. c. ˆ c = Aμc + b μ. (2). ˆ c = AΣc AT Σ. (3). ͜Εʹର͠զʑ͸ɼ೔ຊਓӳ‫ޠ‬ʢERJ: English Read by. ͜͜ͰɼదԠߦྻ A ͱόΠΞεϕΫτϧ b ͸ճ‫ؼ‬ύϥϝʔ. Japaneseʣ[8] Λར༻ͨ͠Ϟσϧߏஙɼ·ͨɼ೔ຊਓӳ‫ޠ‬ͷ. λͰ͋Γɼෳ਺ͷ෼෍͕ଐ͢Δճ‫ؼ‬Ϋϥε͝ͱʹਪఆ͞Ε. ӆ཯‫ޡ‬Γʹର͢Δӆ཯ิਖ਼๏ʹΑΓɼ࿩ऀੑΛ‫͘ڧ‬൓ө͠. ΔɽHMM Ի੠߹੒Ͱ͸ɼεϖΫτϧύϥϝʔλɼԻ‫ݯ‬ύ. ͭͭࣗવੑΛվળ͢Δख๏ΛఏҊ͍ͯ͠Δ [9]ɽ͔͠͠ͳ. ϥϝʔλɼঢ়ଶ‫ܧ‬ଓ௕͕ਖ਼‫ن‬෼෍ͰϞσϧԽ͞Ε͓ͯΓɼ. ͕ΒɼຊఏҊ๏ͷධՁ৚݅͸ɼ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀ͔Β੒Δධ. ͦΕΒશͯʹରͯ͠దԠॲཧ͕ߦΘΕΔɽ͜ΕʹΑΓɼ෼. Ձऀͱগ਺ൃ࿩ऀͷΈʹཹ·͓ͬͯΓɼධՁऀͷ฼‫ൃͱޠ‬. અతಛ௃ͷΈͰͳ͘ӆ཯తಛ௃΋ಉ࣌ʹదԠՄೳʹͳΔɽ. ࿩ऀͷӳ‫ޠ‬शख़౓ʹΑΔӨ‫͕ڹ‬ௐࠪ͞Ε͍ͯͳ͍ɽ·ͨɼ. ύϥϝʔλੜ੒࣌ʹ͸ɼೖྗςΩετΛղੳ͢Δ͜ͱͰ. ೔ຊਓӳ‫ޠ‬ͷࣗવੑ௿ԼͷཁҼͰ͋ΔԻӆ‫ޡ‬Γ͕ߟྀ͞Ε. ಘΒΕΔίϯςΩετʹ‫͖ͮج‬ɼจ HMM Λߏங͢Δɽͦ. ͍ͯͳ͍ͨΊɼಘΒΕΔࣗવੑվળޮՌ͸‫ݶ‬ఆ͞ΕΔɽ. ͷ‫ޙ‬ɼ‫ܧ‬ଓ௕Ϟσϧͷ໬౓࠷େԽʹΑΓঢ়ଶ‫ܧ‬ଓ௕Λܾ. ຊߘͰ͸ɼධՁऀͷ฼‫ൃͱޠ‬࿩ऀͷӳ‫ޠ‬शख़౓͕ӆ཯ิ. ఆͨ͠ͷͪɼ੩తɾಈతಛ௃ྔؒͷ໌ࣔతͳ੍໿ͷԼͰɼ. ਖ਼ͷޮՌʹ༩͑ΔӨ‫ڹ‬Λௐࠪ͢Δͱͱ΋ʹɼແ੠ࢠԻεϖ. HMM ͷ໬౓࠷େԽʹΑΓύϥϝʔλΛੜ੒ [11] ͠ɼϘ. Ϋτϧஔ‫ͮ͘جʹ׵‬Իӆิਖ਼๏ΛఏҊ͢Δɽ࣮‫ݧ‬తධՁʹ. ίʔμʹ‫ͮ͘ج‬೾‫ܗ‬ੜ੒ॲཧΛ‫ͯܦ‬Ի੠͕߹੒͞ΕΔɽ. ΑΓɼ ʢ̍ʣύϫʔิਖ਼ʹΑΔࣗવੑͷվળޮՌ͸ɼӳ‫ޠ‬฼ ‫ޠ‬࿩ऀʹΑΔධՁʹ͓͍ͯ‫ݦ‬ஶͰ͋Δ͜ͱɼ ʢ̎ʣӳ‫ޠ‬शख़ ౓ʹؔΘΒͣɼӆ཯ิਖ਼๏ʹΑΓࣗવੑ͕վળ͢Δ͜ͱɼ ʢ̏ʣԻӆิਖ਼๏΋ࣗવੑվળʹ༗ޮͰ͋Δ͜ͱΛࣔ͢ɽ. 3. ೔ຊਓӳ‫ޠ‬Ի੠߹੒ʹ͓͚Δӆ཯ิਖ਼๏ͱ Իӆิਖ਼๏ 3.1 ϞσϧదԠʹΑΔӆ཯ิਖ਼๏ ਤ 2 ʹϞσϧదԠʹΑΔӆ཯ิਖ਼๏ͷ֓ཁΛࣔ͢ɽ·. 2. HMM Ի੠߹੒ʹ͓͚ΔదԠٕज़. ͣɼӳ‫ޠ‬฼‫ޠ‬࿩ऀͷӳ‫ޠ‬Ի੠Λ༻͍ͯɼӳ‫ޠ‬฼‫ޠ‬࿩ऀʹର. ਤ 1 ʹ HMM Ի੠߹੒ʹ͓͚ΔϞσϧదԠͷ֓ཁਤΛ. ͢Δ࿩ऀґଘ HMM Λֶश͢Δɽ‫؍‬ଌσʔλͱͯ͠༻͍Δ. ࣔ͢ɽHMM Ի੠߹੒Ͱ͸ɼԻ੠ͷεϖΫτϧύϥϝʔλɼ. Ի੠ύϥϝʔλ͸ɼର਺ύϫʔɼεϖΫτϧแབྷύϥϝʔ. Ի‫ݯ‬ύϥϝʔλɼঢ়ଶ‫ܧ‬ଓ௕ΛɼHMM ʹ‫ͮ͘ج‬౷Ұతͳ࿮. λɼԻ‫ݯ‬ύϥϝʔλͰ͋Γɼ֤ύϥϝʔλʹର͢Δग़ྗ֬. ૊ΈͰϞσϧԽ͢Δ [10]ɽίϯςΩετΫϥελϦϯάʹ. ཰෼෍ͱঢ়ଶ‫ܧ‬ଓ௕෼෍͕ಘΒΕΔɽ࣍ʹɼ໨ඪ೔ຊ‫ޠ‬฼. ΑΔΫϥε c ͷग़ྗ֬཰෼෍ bc (ot ) ͸ɼ࣍ࣜͰද͞ΕΔɽ. bc (ot ) = N (ot ; μc , Σc ) ͨͩ͠ɼot =. .     c t , Δct , ΔΔct. (1). ‫ޠ‬࿩ऀͷ࿩ऀੑΛ൓өͨ͠ӳ‫ޠ‬Ի੠߹੒༻ HMM Λߏங͢ ΔͨΊʹɼ໨ඪ࿩ऀͷ೔ຊਓӳ‫ޠ‬Ի੠Λ༻͍ͯɼ ӳ‫ޠ‬฼‫ޠ‬ ࿩ऀͷ HMM ΛదԠ͢Δɽຊख๏Ͱ͸ɼ೔ຊਓӳ‫ޠ‬Ի੠ͷ. ͸ɼ࣌ࠁ t ʹ͓͚Δ੩త. ࣗવੑΛྼԽͤ͞ΔཁҼͱͯ͠ɼ‫ܧ‬ଓ௕‫ͼٴ‬ύϫʔʹண໨. ಛ௃ྔ ct ͱͦͷҰ࣍ͱೋ࣍ͷಈతಛ௃ྔ Δct ɼΔΔct ͷ݁. ͠ɼঢ়ଶ‫ܧ‬ଓ௕ͱର਺ύϫʔҎ֎ʹର͢ΔϞσϧύϥϝʔ. ߹ϕΫτϧΛද͠ɼN (·; μc , Σc ) ͸ɼฏ‫ ۉ‬μc ɼ‫ڞ‬෼ࢄߦྻ. λͷΈΛదԠ͢Δ͜ͱͰɼӳ‫ޠ‬฼‫ޠ‬࿩ऀͷӆ཯Λߟྀͨ͠. Σc Λ࣋ͭਖ਼‫ن‬෼෍Λද͢ɽ. ೔ຊਓӳ‫ޠ‬ͷ HMM Λߏங͢ΔɽຊదԠ๏ʹΑΓɼ໨ඪ೔. HMM Ի੠߹੒Ͱ͸ɼϞσϧదԠٕज़Λ༻͍Δ͜ͱͰɼ 1. ಸྑઌ୺Պֶٕज़େֶӃେֶɹ৘ใՊֶ‫ڀݚ‬Պ. ⓒ 2015 Information Processing Society of Japan. ຊ‫ޠ‬฼‫ޠ‬࿩ऀͷ࿩ऀੑΛग़དྷΔ‫ݶ‬Γอ࣋ͨ͠··ɼࣗવੑ ͕վળ͞Εͨ೔ຊਓӳ‫ޠ‬Ի੠ͷ߹੒͕ՄೳʹͳΔ [9]ɽ. 2.

(3) Vol.2015-SLP-105 No.3 2015/2/27. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report ද 1. ධՁʹ༻͍Δख๏. Table 1 Synthetic speech samples used for evaluation. ख๏໊. ֶशσʔλ. దԠσʔλ. ӆ཯ิਖ਼. Իӆิਖ਼. ERJ. ೔ຊਓӳ‫ޠ‬. –. ͳ͠. ͳ͠. 2. HMM+VC. ӳ‫ޠ‬฼‫ޠ‬࿩ऀӳ‫ޠ‬. –. –. –. Adapt. ӳ‫ޠ‬฼‫ޠ‬࿩ऀӳ‫ޠ‬. ೔ຊਓӳ‫ޠ‬. ͳ͠. ͳ͠. Dur.. ӳ‫ޠ‬฼‫ޠ‬࿩ऀӳ‫ޠ‬. ೔ຊਓӳ‫ޠ‬. ঢ়ଶ‫ܧ‬ଓ௕. ͳ͠. Dur.+Pow.. ӳ‫ޠ‬฼‫ޠ‬࿩ऀӳ‫ޠ‬. ೔ຊਓӳ‫ޠ‬. ঢ়ଶ‫ܧ‬ଓ௕ɼର਺ύϫʔ. ͳ͠. Dur.+Pow.+UVC. ӳ‫ޠ‬฼‫ޠ‬࿩ऀӳ‫ޠ‬. ೔ຊਓӳ‫ޠ‬. ঢ়ଶ‫ܧ‬ଓ௕ɼର਺ύϫʔ. ແ੠ࢠԻεϖΫτϧ. Native. ӳ‫ޠ‬฼‫ޠ‬࿩ऀӳ‫ޠ‬. –. –. –. 3.2 ແ੠ࢠԻεϖΫτϧஔ‫ʹ׵‬ΑΔԻӆิਖ਼. Z:ƐƉĞĞĐŚ. ਤ 3 ʹແ੠ࢠԻεϖΫτϧஔ‫ͮ͘جʹ׵‬Իӆิਖ਼๏ͷ खॱΛࣔ͢ɽఏҊ๏Ͱ͸ɼӳ‫ޠ‬฼‫ޠ‬࿩ऀͷεϖΫτϧύϥ ϝʔλΛ෦෼తʹ࢖༻͢Δ͜ͱͰɼ೔ຊਓӳ‫ޠ‬ͷԻӆΛิ. EĂƚŝǀĞŶŐůŝƐŚ ƐƉĞĞĐŚ. ਖ਼͢ΔɽԻߴ΍฼Ի͸࿩ऀੑ஌֮ʹ‫͘ڧ‬Ө‫͢ڹ‬Δ [12] Ұํ Ͱɼແ੠ࢠԻͷ࿩ऀґଘੑ͸খ͍͞ͱ༧૝͞ΕΔɽͦͷͨ Ίɼ೔ຊਓӳ‫ޠ‬ͷແ੠ࢠԻͷஔ‫ʹ׵‬ΑΓɼ࿩ऀੑΛอ࣋͠ ͭͭࣗવੑΛվળͰ͖Δͱߟ͑ΒΕΔɽ ·ͣɼӳ‫ޠ‬฼‫ޠ‬࿩ऀ HMM ͱӆ཯ิਖ਼͞Εͨ೔ຊਓӳ‫ޠ‬. HMM ͔ΒɼͦΕͧΕԻ੠ύϥϝʔλΛੜ੒͢Δɽ͜͜Ͱɼ ֤ HMM ͸ಉҰͷ‫ܧ‬ଓ௕ϞσϧΛ༗͢ΔͨΊɼੜ੒ύϥ. ແ੠ࢠԻʹରԠ͢ΔϑϨʔϜͷΈΛɼӳ‫ޠ‬฼‫ޠ‬࿩ऀͷεϖ. džĐŝƚĂƚŝŽŶ. ĚĂƉƚ. ^ƉĞĐƚƌƵŵ džĐŝƚĂƚŝŽŶ. WŽǁĞƌ. WŽǁĞƌ. ƵƌĂƚŝŽŶ. ƵƌĂƚŝŽŶ. Z:,^DDƐ ǁŝƚŚŵŽĚŝĨŝĞĚ ƉƌŽƐŽĚLJ. ਤ 2 ϞσϧదԠʹΑΔӆ཯ิਖ਼ͷ֓ཁ. Fig. 2 An overview of the prosody correction method based on model adaptation technique. EĂƚŝǀĞ ,^DDƐ. ϝʔλ͸࣌ؒతʹରԠ෇͚ΒΕ͍ͯΔ͜ͱʹ஫ҙ͢Δɽ࣍ ʹɼ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀͷεϖΫτϧύϥϝʔλ‫ྻܥ‬ͷ͏ͪɼ. ^ƉĞĐƚƌƵŵ EĂƚŝǀĞ ,^DDƐ. Z:,^DDƐ ǁŝƚŚŵŽĚŝĨŝĞĚ ƉƌŽƐŽĚLJ. ^ƉĞĐƚƌĂůƉĂƌĂŵĞƚĞƌ ^ǁĂƉƵŶǀŽŝĐĞĚĐŽŶƐŽŶĂŶƚ ǁŝƚŚEĂƚŝǀĞ͛Ɛ. ^LJŶƚŚĞƐŝƐ. WŚŽŶĞŵĞŵŽĚŝĨŝĞĚ ƐLJŶƚŚĞƚŝĐƐƉĞĞĐŚ ŽĨZ:ƐƉĞĂŬĞƌ. ^ƉĞĐƚƌĂůƉĂƌĂŵĞƚĞƌ džĐŝƚĂƚŝŽŶƉĂƌĂŵĞƚĞƌ. Ϋτϧύϥϝʔλʹஔ‫͢׵‬Δɽஔ‫׵‬ͷࡍʹɼஔ‫ޙ׵‬ͷεϖ Ϋτϧͱ‫ݩ‬ͷ༗੠ʗແ੠৘ใͷෆҰகʹΑΓੜ͡ΔԻ࣭ྼ ԽΛճආ͢ΔͨΊɼແ੠ࢠԻͷϑϨʔϜʹ͓͚Δӳ‫ޠ‬฼‫ޠ‬ ࿩ऀͷ F0 ͕༗੠Ͱ͋Δ৔߹ɼ౰֘ϑϨʔϜΛஔ‫͍ͳ͠׵‬ɽ. ਤ 3. ແ੠ࢠԻεϖΫτϧஔ‫ʹ׵‬ΑΔԻӆิਖ਼ͷ֓ཁ. Fig. 3 An overview of the phoneme correction method based on spectrum swapping of the unvoiced consonants.. τϦʔϜͰֶश͢ΔɽϞσϧదԠ͸ CSMAPLR+MAP[16]. 4. ࣮‫ݧ‬తධՁ 4.1 ࣮‫ݧ‬৚݅ ֶशσʔλͱͯ͠ɼCMU ARCTIC Ի੠σʔλϕʔε [13]. Λར༻͠ɼճ‫ʹྻߦؼ‬͸੩తಛ௃ྔɼ1 ࣍ͱ 2 ࣍ͷಈతಛ ௃ྔʹରԠͨ͠ϒϩοΫର֯ߦྻΛ༻͍Δɽͨͩ͠దԠ࣌ ʹ͸ɼదԠσʔλͷ࿩ऀͱಉ͡ੑผͷӳ‫ޠ‬฼‫ޠ‬࿩ऀͷσʔ λͰֶश͞Εͨ HMM Λ༻͍Δɽ. தͷӳ‫ޠ‬฼‫ޠ‬࿩ऀͷஉঁ֤ 1 ໊ʹΑΔ A ηοτ 593 จΛ. ఏҊ๏ʹΑΔӆ཯ิਖ਼ͷޮՌΛධՁ͢ΔͨΊʹɼද 1 ʹ. ༻͍ΔɽධՁσʔλ͸ಉ B ηοτ 50 จͱ͢Δɽֶशσʔ. ࣔ͢ख๏ʹΑΔ߹੒Ի੠Λ༻͍ͯɼ࿩ऀੑɼࣗવੑ‫໌ͼٴ‬. λɼධՁσʔλɼ‫ͼٴ‬ɼదԠσʔλͷαϯϓϦϯάप೾਺͸. ྎੑʹؔ͢Δओ‫؍‬ධՁΛ࣮ࢪ͢Δɽ. 16 kHz Ͱ͋ΔɽԻ੠ύϥϝʔλͷ෼ੳʹ͸ STRAIGHT ෼ ੳ [14] Λ࢖༻͠ɼεϖΫτϧಛ௃ྔͱͯ͠ɼର਺ύϫʔ͓Α. 4.2 ϞσϧదԠʹΑΔӆ཯ิਖ਼ͷޮՌ. ͼ 1 ͔࣍Β 24 ࣍ͷϝϧέϓετϥϜ܎਺Λ༻͍ΔɽԻ‫ݯ‬. 4.2.1 ӆ཯ิਖ਼๏ʹ͓͚ΔධՁऀͷ฼‫ޠ‬ͷӨ‫ڹ‬. ಛ௃ྔͱͯ͠ɼର਺ F0 ‫ ͼٴ‬5 प೾਺ଳҬʹ͓͚Δฏ‫ۉ‬ඇप. ໨ඪ࿩ऀ͸ɼ20 ୅உੑͷ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀ 2 ໊ͱ͢Δɽ಺. ‫ظ‬੒෼Λ༻͍ΔɽϑϨʔϜγϑτ͸ 5 ms ͱ͢Δɽ͜ΕΒͷ. Ұਓ͸ɼཹֶ‫ݧܦ‬ͷແ͍େֶӃੜͰ͋Γɼ೔ຊͷඪ४తͳ. Ի੠ύϥϝʔλʹ 1 ࣍ͱ 2 ࣍ͷಈతಛ௃ྔΛՃ͑ͨ΋ͷΛ. ӳ‫ڭޠ‬ҭΛड͚͖ͯͨ࿩ऀͰ͋Δʢ“Monolingual”ʣ ɽ΋͏. ‫؍‬ଌϕΫτϧͱ͠ɼ5 ঢ়ଶ left-to-right ‫ܕ‬ͷ HSMM[15] ͷ. Ұਓ͸ɼ1 ೥ؒΦʔετϥϦΞ΁ͷཹֶ‫ݧܦ‬ͷ͋Δେֶੜ. ֶशΛߦ͏ɽର਺ύϫʔͱϝϧέϓετϥϜ܎਺͸ಉҰε. Ͱ͋Γɼӳ‫ޠ‬शख़౓͕ߴ͍࿩ऀͰ͋Δʢ“Bilingual”ʣɽ্ ‫ ه‬2 ໊͕ൃ੠ͨ͠ ARCTIC Ի੠σʔλϕʔεதͷ A ηο. 2. ैདྷ๏ [5]ʢͨͩ͠ɼҰରଟ࿩ऀม‫Ͱ׵‬͸ͳ͘೔ຊਓӳ‫ޠ‬Λ༻͍ ͨҰରҰ࿩ऀม‫׵‬Λ࢖༻ʣʹ‫͖ͮج‬ɼӳ‫ޠ‬฼‫ޠ‬࿩ऀͷ࿩ऀґଘ HSMM ͷग़ྗԻ੠ύϥϝʔλʹରͯ͠ɼGMM ʹ‫ͮ͘ج‬౷‫ܭ‬త ੠࣭ม‫׵‬Λద༻. ⓒ 2015 Information Processing Society of Japan. τ 593 จΛదԠσʔλͱͯ͠࢖༻͢Δɽ ࿩ऀੑͷධՁͰ͸ɼ໨ඪ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀͷ೔ຊ‫ޠ‬෼ੳ߹ ੒Ի੠ΛϦϑΝϨϯεͱͨ͠ 5 ஈ֊ DMOSʢDegradation. 3.

(4) Vol.2015-SLP-105 No.3 2015/2/27. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. Mean Opinion ScoreʣධՁΛ࣮ࢪ͢ΔɽධՁ͢Δख๏͸ɼ. ϱ. ϱ. ϵϱйŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů. ϵϱйŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů. ؔ͢Δ 5 ஈ֊ MOSʢMean Opinion ScoreʣධՁΛ࣮ࢪ ͢ΔɽධՁ͢Δख๏͸ɼ“ERJ”ɼ“HMM+VC”ɼ“Adapt”ɼ. “Dur.”ɼ“Dur.+Pow.”ɼ“Native”ͷ 6 ͭͰ͋Δɽͳ͓ɼ֤. ϰ. ϰ. ϯ. DK^. ͷ 5 ͭͰ͋ΔɽࣗવੑͷධՁͰ͸ɼӳ‫ޠ‬Ի੠ͷࣗવੑʹ. DK^. “ERJ”ɼ“HMM+VC”ɼ“Adapt”ɼ“Dur.”ɼ“Dur.+Pow.”. Ϯ. Ϯ. ϭ. ϭ. ϯ. ͸ɼ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀʹΑΔείΞͱൺֱͯ͠ɼେ͖͘‫ݮ‬গ. ਤ 4. ૬ରతͳείΞͷ্ঢ͕ΈΒΕΔɽ͜ΕΒͷ݁Ռ͸ɼӳ‫ޠ‬ ൃ࿩ͷϦζϜ͓Αͼ‫ڧ‬੎ʹରͯ͠ɼӳ‫ޠ‬฼‫ޠ‬࿩ऀ͸೔ຊ‫ޠ‬ ฼‫ޠ‬࿩ऀΑΓ΋աහͰ͋ΔͨΊͩͱߟ͑ΒΕΔɽ ͳ͓ɼ྆฼‫ޠ‬࿩ऀʹΑΔࣗવੑʹؔ͢ΔධՁʹ͓͍ͯɼ. “Dur.” ͱ “Dur.+Pow.” ͸ଞͷख๏ΑΓ΋ߴ͍είΞΛ֫ ಘ͍ͯ͠Δɽ·ͨɼ࿩ऀੑʹؔ͢ΔධՁʹ͓͍ͯ΋ɼ“Dur.” ͱ “Dur.+Pow.” ͸ “ERJ”ͱಉ౳ͷ࿩ऀੑΛอ͍࣋ͯ͠Δɽ ͜ͷ͜ͱ͔ΒɼఏҊ͢Δӆ཯ิਖ਼๏ͷ༗ޮੑ͕֬ೝͰ͖Δɽ Ҏ্ͷ݁Ռ͔Βɼ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀͱӳ‫ޠ‬฼‫ޠ‬࿩ऀͱͷؒ ʹ͸ධՁ݁Ռʹҧ͍͕ੜ͓ͯ͡Γɼӳ‫ޠ‬฼‫ޠ‬࿩ऀͷํ͕Α Γӳ‫ൃޠ‬࿩ͷӆ཯ʹରͯ͠හ‫͋Ͱײ‬Δ͜ͱ͕֬ೝͰ͖ɼ· ͨɼఏҊ๏ʹΑΔ‫ܧ‬ଓ௕‫ͼٴ‬ύϫʔิਖ਼ʹΑΓɼ೔ຊ‫ޠ‬฼. Ƶƌ͘нWŽǁ͘. Z: ,DDнs ĚĂƉƚ Ƶƌ͘. Ƶƌ͘нWŽǁ͘. Z: ,DDнs ĚĂƉƚ Ƶƌ͘. Ƶƌ͘нWŽǁ͘. prosody correction method. ϱ. ϱ. ϵϱйŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů. ϵϱйŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů. ϰ. ϰ ϯ. DK^. ऀʹΑΔධՁͰ͸ɼ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀʹΑΔධՁͱൺֱͯ͠ɼ. ӆ཯ิਖ਼๏ʹର͢Δ࿩ऀੑʹؔ͢Δओ‫؍‬ධՁ݁Ռ. DK^. ਤ 5 ʹࣔࣗ͢વੑʹؔ͢ΔධՁ݁Ռʹ͓͍ͯɼӳ‫ޠ‬฼‫ޠ‬࿩. DŽŶŽůŝŶŐƵĂů. Fig. 4 Results of subjective evaluation of individuality for. ͢Δ܏޲͕‫ݟ‬ΒΕΔɽ࣍ʹɼӳ‫ޠ‬฼‫ޠ‬࿩ऀͷύϫʔΛ൓ө ͨ͠ख๏ʢ“HMM+VC” ͱ “Dur.+Pow.”ʣʹண໨͢Δͱɼ. ŝůŝŶŐƵĂů. ;ďͿǀĂůƵĂƚĞĚďLJŶŐůŝƐŚƐƉĞĂŬĞƌƐ. ;ĂͿǀĂůƵĂƚĞĚďLJ:ĂƉĂŶĞƐĞƐƉĞĂŬĞƌƐ. Ϯ. Ϯ. ϭ. ϭ. ϯ. ŝůŝŶŐƵĂů. DŽŶŽůŝŶŐƵĂů. ;ĂͿǀĂůƵĂƚĞĚďLJ:ĂƉĂŶĞƐĞƐƉĞĂŬĞƌƐ ਤ 5. Z: ,DDнs ĚĂƉƚ Ƶƌ͘ Ƶƌ͘нWŽǁ͘ EĂƚŝǀĞ. (a) ͱ (b) ͷൺֱ͔Βɼӳ‫ޠ‬฼‫ޠ‬࿩ऀʹΑΔࣗવੑͷείΞ. DŽŶŽůŝŶŐƵĂů. Z: ,DDнs ĚĂƉƚ Ƶƌ͘ Ƶƌ͘нWŽǁ͘ EĂƚŝǀĞ. ͢Δɽਤ 4 ͷ (a) ͱ (b) ͷൺֱ͔Βɼ࿩ऀੑͷείΞ͸ɼ ҟͳΔ฼‫ޠ‬Λ࣋ͭධՁऀؒͰಉఔ౓Ͱ͋ΔҰํͰɼਤ 5 ͷ. ŝůŝŶŐƵĂů. Z: ,DDнs ĚĂƉƚ Ƶƌ͘ Ƶƌ͘нWŽǁ͘ EĂƚŝǀĞ. ʢ“ERJ” ͱ “Adapt”ʣʹ͓͚ΔධՁऀͷ฼‫ޠ‬ͷӨ‫ʹڹ‬ண໨. Z: ,DDнs ĚĂƉƚ Ƶƌ͘. ࣗવੑʹؔ͢ΔධՁ݁ՌΛࣔ͢ 3 ɽ·ͣɼิਖ਼ແ͠ͷख๏. Ƶƌ͘нWŽǁ͘. ਤ 4 ͱਤ 5 ʹͦΕͧΕɼӆ཯ิਖ਼๏ʹର͢Δ࿩ऀੑͱ. Z: ,DDнs ĚĂƉƚ Ƶƌ͘ Ƶƌ͘нWŽǁ͘ EĂƚŝǀĞ. ‫ͼٴޠ‬ӳ‫ޠ‬฼‫ޠ‬࿩ऀ֤ 6 ໊ʹΑΓ࣮ࢪ͢Δɽ. Z: ,DDнs ĚĂƉƚ Ƶƌ͘. ධՁ͸ɼ໨ඪ࿩ऀຖʹ࡞੒࣮ͨ͠‫ݧ‬ηοτΛ༻͍ͯɼ೔ຊ. ŝůŝŶŐƵĂů. DŽŶŽůŝŶŐƵĂů. ;ďͿǀĂůƵĂƚĞĚďLJŶŐůŝƐŚƐƉĞĂŬĞƌƐ. ӆ཯ิਖ਼๏ʹର͢Δࣗવੑʹؔ͢Δओ‫؍‬ධՁ݁Ռ. Fig. 5 Results of subjective evaluation of naturalness for prosody correction method.. ‫ޠ‬࿩ऀ 6 ໊ʹΑΓ࣮ࢪ͢Δɽ ਤ 7 ʹɼӳ‫ޠ‬शख़౓ʢ“High” ͱ “Low”ʣຖʹू‫ͨ͠ܭ‬ɼ. ‫ޠ‬࿩ऀͷ࿩ऀੑΛอ࣋ͭͭ͠ɼӳ‫ޠ‬฼‫ޠ‬࿩ऀʹͱͬͯࣗવ. ӆ཯ิਖ਼๏ʹର͢Δ࿩ऀੑͱࣗવੑʹؔ͢Δओ‫؍‬ධՁ݁Ռ. ੑͷߴ͍ӳ‫ޠ‬Ի੠Λ߹੒Ͱ͖Δ͜ͱ͕෼͔Δɽ. Λࣔ͢ɽ·ͣ࿩ऀੑʹؔ͢ΔධՁ݁Ռʹ͓͍ͯɼGMM ੠࣭. 4.2.2 ӆ཯ิਖ਼๏ʹ͓͚Δൃ࿩ऀͷӳ‫ޠ‬शख़౓ͷӨ‫ڹ‬. ม‫׵‬Λར༻ͨ͠ख๏ “HMM+VC”ʹண໨͢Δͱɼ“Low”ʹ. దԠσʔλ͸ɼ೔ຊਓֶੜʹΑΔಡΈ্͛ӳ‫ޠ‬Ի੠σʔ. ରͯ͠ɼશϞσϧύϥϝʔλΛదԠͨ͠ख๏ “Adapt”ͱൺ. λϕʔε [8] தͷ࠷ߴʢ ʠHighʡ ʣ΋͘͠͸࠷௿ʢ ʠLowʡ ʣӳ‫ޠ‬. ֱͯ͠ɼ࿩ऀੑ͕େ͖͘ྼԽ͢Δ܏޲͕‫ݟ‬ΒΕΔɽ“High”. शख़౓είΞΛ࣋ͭஉঁ‫ ܭ‬4 ໊ʹΑΔ TIMIT[17] 60 จͱ. ʹ͓͍ͯ΋ɼྼԽͷఔ౓͸খ͘͞ͳΔ͕ɼಉ༷ͷ܏޲͕‫ݟ‬. ͢Δɽͨͩ͠ɼຊߘͷӳ‫ޠ‬शख़౓͸ɼσʔλϕʔεதͰఆٛ. ΒΕΔɽҰํͰɼఏҊ๏ͷ‫ܧ‬ଓ௕͓ΑͼύϫʔΛิਖ਼ͨ͠. ͞Ε͍ͯΔෳ਺ͷ‫ج‬४ʢԻૉੜ੒ɼϦζϜ౳ʣʹ͓͚Δධఆ. “Dur.+Pow.”ʹؔͯ͠͸ɼӳ‫ޠ‬शख़౓ʹؔ܎ͳ͘ “Adapt”. ఺ͷฏ‫ۉ‬Λࢦ͢ɽධՁ๏͸ 4.2.1 અͱಉ༷ʢͨͩ͠ɼ࿩ऀੑ. ͱಉ౳ͷ࿩ऀੑΛอ͍ͬͯΔ͜ͱ͕෼͔Δɽ. ͷධՁͰ͸ɼ໨ඪ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀͷ೔ຊਓӳ‫ޠ‬෼ੳ߹੒Ի. ࣍ʹɼࣗવੑʹؔ͢ΔධՁ݁ՌΛ‫ݟ‬Δͱɼ“HMM+VC”. ੠ΛϦϑΝϨϯεͱ͢Δ఺ͷΈҟͳΔʣͰ͋Γɼ࿩ऀੑͷධ. ͱൺֱ͠ɼ“Adapt”͸”Low”ʹ͓͍ͯେ෯ͳྼԽΛੜ͡͞. ՁͰ͸ “HMM+VC”ɼ“Adapt”ɼ“Dur.+Pow.” ͷ 3 ͭɼࣗ. ͤΔ͜ͱ͕෼͔Δɽ͜Εʹର͠ɼ“Dur.+Pow.”͸ɼӆ཯ิ. વੑͷධՁͰ͸ “HMM+VC”ɼ “Adapt”ɼ “Dur.+Pow.”ɼ. ਖ਼ʹΑΓࣗવੑྼԽΛ๷͙͜ͱ͕ՄೳͰ͋Γɼӳ‫ޠ‬शख़౓. “Native” ͷ 4 ͭͷख๏ΛධՁ͢Δɽͳ͓ɼ֤ධՁ͸ɼશ೔. ʹؔ܎ͳ͘ “HMM+VC”ͱಉ౳ͷࣗવੑ͕ಘΒΕΔ͜ͱ͕. ຊ‫ޠ‬฼‫ޠ‬࿩ऀͷԻ੠Λ‫ؚ‬Μ࣮ͩ‫ݧ‬ηοτΛ༻͍ͯɼӳ‫ޠ‬฼. ෼͔Δɽ Ҏ্ͷ݁Ռ͔Βɼӳ‫ޠ‬शख़౓ʹؔΘΒͣɼఏҊ๏͕‫݈ؤ‬. 3. ͨͩ͠ɼਤ 4 (a) ͱਤ 5 (a) ͸ [9] ͷ࠶‫͋Ͱܝ‬Δɽ. ⓒ 2015 Information Processing Society of Japan. ʹಈ࡞͢Δ͜ͱΛ֬ೝͰ͖ɼ‫ܧ‬ଓ௕‫ͼٴ‬ύϫʔิਖ਼ʹΑΓɼ. 4.

(5) Vol.2015-SLP-105 No.3 2015/2/27. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ IPSJ SIG Technical Report. ϯ Ϯ. Ϯ. ϭ. ϭ. ,ŝŐŚ ֤छ߹੒Ի੠ͷεϖΫτϩάϥϜʢൃ࿩จதͷ “consonants” ͱ͍͏୯‫ʹޠ‬ରԠʣͷྫɽ. ਤ 7. 4 ໊ʹΑΔ TIMIT 60 จͱ͢Δɽ࿩ऀੑͷධՁͰ͸ɼ೔ຊ ‫ޠ‬฼‫ޠ‬࿩ऀͷ೔ຊਓӳ‫ޠ‬෼ੳ߹੒Ի੠ΛϦϑΝϨϯεͱ͠. >Žǁ. ͨϓϦϑΝϨϯεςετ (XAB ςετ) Λ࣮ࢪ͢ΔɽධՁ. ,ŝŐŚ. ͢Δख๏͸ɼ“Dur.+Pow.”ɼ“Dur.+Pow.+UVC”ͷ 2 ͭͰ. ;ĂͿ/ŶĚŝǀŝĚƵĂůŝƚLJ ਤ 8. ,ŝŐŚ. ĚĂƉƚ. EĂƚŝǀĞ. EĂƚŝǀĞ. Ϭ. Ƶƌ͘нWŽǁ͘. จɼฒͼʹɼ4.2.2 અͷ “High” ΋͘͠͸ “Low” ʹଐ͢Δ. Ϭ͘Ϯ. Ƶƌ͘нWŽǁ͘нhs. Ϭ. ͸ɼ“Dur.+Pow.”ɼ“Dur.+Pow.+UVC”ɼ“Native”ͷ 3 ͭ. Ƶƌ͘нWŽǁ͘. Ϭ͘ϰ. Ϭ͘ϰ. gual” ʹΑΔ ARCTIC Ի੠σʔλϕʔεதͷ A ηοτ 60. ϑΝϨϯεςετ (AB ςετ) Λ࣮ࢪ͢ΔɽධՁ͢Δख๏. EĂƚŝǀĞ. Ϭ͘ϲ. Ϭ͘ϲ. Ϭ͘Ϯ. ͋ΔɽࣗવੑͷධՁͰ͸ɼӳ‫ޠ‬Ի੠ͷࣗવੑʹؔ͢ΔϓϦ. ĚĂƉƚ. Ϭ͘ϴ. Ϭ͘ϴ. దԠσʔλ͸ɼ 4.2.1 અͷ “Monolingual” ‫“ ͼٴ‬Bilin-. ϵϱйŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů. EĂƚŝǀĞ. 4.3 ແ੠ࢠԻεϖΫτϧஔ‫ʹ׵‬ΑΔԻӆิਖ਼ͷޮՌ. ϭ. ϵϱйŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů. WƌĞĨĞƌĞŶĐĞƐĐŽƌĞ. ౓ͷ௿͍࿩ऀʹ͓͍ͯಛʹ༗ޮͰ͋Δ͜ͱ͕෼͔Δɽ. ӳ‫ޠ‬शख़౓ຖʹ‫ͨ͠ࢉܭ‬ओ‫؍‬ධՁ݁Ռʢӆ཯ิਖ਼๏ʣ. (prosody correction method). ϭ. ೔ຊਓӳ‫ޠ‬ͷ࿩ऀੑΛอ࣋ͭͭ͠ɼࣗવੑͷߴ͍ӳ‫ޠ‬Ի੠ Λ߹੒Ͱ͖Δ͜ͱ͕෼͔Δɽ·ͨɼิਖ਼ޮՌ͸ɼӳ‫ޠ‬शख़. >Žǁ. ;ďͿEĂƚƵƌĂůŶĞƐƐ. WƌĞĨĞƌĞŶĐĞƐĐŽƌĞ. for a word fragment “consonants”. ,ŝŐŚ. Fig. 7 Results calculated in each English proficiency level. Ƶƌ͘нWŽǁ͘нhs. Fig. 6 Example of spectrograms of synthetic speech samples. >Žǁ. ;ĂͿ/ŶĚŝǀŝĚƵĂůŝƚLJ. Ƶƌ͘нWŽǁ͘. ਤ 6. ,DDнs. 6 8. Ƶƌ͘нWŽǁ͘. ,DDнs. ϯ. Ƶƌ͘нWŽǁ͘нhs. EĂƚŝǀĞ. 0 2 4. ϰ. ,DDнs. 6 8. ϰ. ϵϱйŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů. Ƶƌ͘нWŽǁ͘. 2 4. ϱ. ϵϱйŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů. DK^. Ƶƌ͘нWŽǁ͘нhs. 0. ϱ. Ƶƌ͘нWŽǁ͘. Ɛ. ĚĂƉƚ. ƚ. Ƶƌ͘нWŽǁ͘нhs. 2 4 6 8. Ŷ. ,DDнs. 0. Ădž. Ƶƌ͘нWŽǁ͘. Ɛ Ădž Ŷ Ƶƌ͘нWŽǁ͘. ĚĂƉƚ. Ŷ. Ƶƌ͘нWŽǁ͘. ĂĂ. DK^. &ƌĞƋƵĞŶĐLJ΀Ŭ,nj΁. Frequency (kHz)Frequency (kHz)Frequency (kHz). Ŭ. >Žǁ. ;ďͿEĂƚƵƌĂůŶĞƐƐ. ӳ‫ޠ‬शख़౓ຖʹ‫ͨ͠ࢉܭ‬ओ‫؍‬ධՁ݁ՌʢԻӆิਖ਼๏ʣ. Fig. 8 Results calculated in each English proficiency level (phoneme correction method).. Ͱ͋Δɽ֤ධՁ͸ɼશͯͷ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀͷԻ੠Λ‫ؚ‬Μͩ. “Dur.+Pow.+UVC”͸ “Dur.+Pow.”ͱಉ౳ͷࣗવੑ‫ͼٴ‬. ࣮‫ݧ‬ηοτΛ༻͍ͯɼӳ‫ޠ‬฼‫ޠ‬࿩ऀ 6 ໊ʹΑΓ࣮ࢪ͢Δɽ. ࿩ऀੑΛอ࣋Ͱ͖Δ͜ͱ͕෼͔Δɽͳ͓ɼ“Dur.+Pow.”ͱ. ͨͩ͠ɼධՁ݁Ռ͸ӳ‫ޠ‬शख़౓ຖʹ‫͠ࢉܭ‬ɼ“Monolingual”. “Dur.+Pow.+UVC”ʹର͠ɼt ‫ݕ‬ఆΛߦͬͨͱ͜Ζɼ“Low”. ͱ “Bilingual” ͸ͦΕͧΕɼ“Low” ͱ “High” ʹଐ͢Δ΋. ͷࣗવੑͷΈ༗ҙ͕ࠩ֬ೝ͞Εͨʢp < .01ʣɽ. ͷͱ͢Δɽ ਤ 6 ʹɼ֤ख๏ʹΑΔεϖΫτϧάϥϜͷྫΛࣔ͢ɽਤ ͔Βɼ“Native”ͱൺֱ͠ɼ“Dur.+Pow.”Ͱ͸ແ੠ࢠԻ෦෼. Ҏ্ͷ݁Ռ͔Βɼӆ཯ิਖ਼๏ͱಉ༷ʹɼఏҊͨ͠Իӆิ ਖ਼๏΋ࣗવੑվળʹ༗ޮͰ͋Γɼಛʹӳ‫ޠ‬शख़౓ͷ௿͍࿩ ऀʹ͓͍ͯ༗ޮͰ͋Δ͜ͱ͕෼͔Δɽ. ʢ/s/ͳͲʣʹ͓͍ͯɼಛʹߴप೾਺ྖҬʹ͓͚ΔεϖΫτ ϧแབྷͷ‫ܗ‬ঢ়͕େ͖͘ҟͳΔ͜ͱ͕෼͔Δɽ͜Ε͸ɼύ. 4.4 ໌ྎੑʹؔ͢ΔධՁ. ϫʔิਖ਼Λ࣮ࢪͨ͠ࡍʹɼҟԻΛੜͤ͡͞ΔཁҼͱͳΔɽ. ఏҊ๏ʹରͯ͠ɼ໌ྎੑʹؔ͢Δॻ͖औΓࢼ‫ݧ‬Λ࣮ࢪ. ͜Εʹର͠ɼ“Dur.+Pow.+UVC”Ͱ͸ɼ“Native”ಉ༷ͷε. ͢ΔɽධՁσʔλ͸ SUS[18] 50 จͱ͠ɼධՁ͢Δख๏͸ɼ. ϖΫτϧแབྷ‫ܗ‬ঢ়͕ಘΒΕΔͨΊɼύϫʔิਖ਼ʹΑΔѱӨ. “HMM+VC”ɼ“Dur.+Pow.+UVC”ɼ“Native”ͷ 3 ͭͰ͋. ‫ڹ‬Λ؇࿨͢Δ͜ͱ͕Ͱ͖ɼࣗવੑͷ޲্͕‫ظ‬଴Ͱ͖Δɽ. Δɽͳ͓ɼ֤ධՁ͸ɼશ೔ຊ‫ޠ‬฼‫ޠ‬࿩ऀͷԻ੠Λ‫ؚ‬Μ࣮ͩ. ਤ 8 ʹɼӳ‫ޠ‬शख़౓ʢ“High” ͱ “Low”ʣຖʹू‫͠ܭ‬. ‫ݧ‬ηοτΛ༻͍ͯɼӳ‫ޠ‬฼‫ޠ‬࿩ऀ 6 ໊ʹΑΓ࣮ࢪ͢Δɽͨ. ͨɼԻӆิਖ਼๏ʹର͢Δ࿩ऀੑͱࣗવੑʹؔ͢Δओ‫؍‬ධՁ. ͩ͠ɼධՁ݁Ռ͸ӳ‫ޠ‬शख़౓ຖʹ‫͠ࢉܭ‬ɼ“Monolingual”. ݁ՌΛࣔ͢ɽ“Low”ʹண໨͢Δͱɼ“Dur.+Pow.+UVC”͸. ͱ “Bilingual” ͸ͦΕͧΕɼ“Low” ͱ “High” ʹଐ͢Δ΋. “Dur.+Pow.”ͱൺֱͯ͠ɼ࿩ऀੑΛಉ౳ఔ౓ʹอ࣋ͭͭ͠. ͷͱ͢Δɽ. ࣗવੑΛվળͰ͖Δ͜ͱ͕෼͔Δɽ·ͨɼ“High”ʹ͓͍ͯɼ. ⓒ 2015 Information Processing Society of Japan. ਤ 9 ʹɼӳ‫ޠ‬शख़౓ʢ“High” ͱ “Low”ʣຖʹू‫͠ܭ‬. 5.

(6) Vol.2015-SLP-105 No.3 2015/2/27. ৘ใॲཧֶձ‫ڀݚ‬ใࠂ. ϴϱ. tŽƌĚĐŽƌƌĞĐƚ tŽƌĚĐŽƌƌĞĐƚƌĂƚĞ tŽƌĚĂĐĐƵƌĂĐLJ tŽƌĚĂĐĐƵƌĂĐLJ. ϴϬ. [4]. ϳϱ ϳϬ. ਤ 9. EĂƚŝǀĞ. Ƶƌ͘нWŽǁ͘нhs. Ƶƌ͘нWŽǁ͘нhs. ,ŝŐŚ. ,DDнs. [5]. ϲϱ. ,DDнs. ZĞĐŽŐŶŝƚŝŽŶƌĞƚĞ΀й΁. IPSJ SIG Technical Report. [6]. >Žǁ. ӳ‫ޠ‬शख़౓ຖʹ‫ྎ໌ͨ͠ࢉܭ‬ੑʹؔ͢Δॻ͖औΓࢼ‫݁ݧ‬Ռ. Fig. 9 Results of dictation test on intelligibility calculated in. [7]. each English proficiency level.. ͨ ɼॻ ͖ औ Γ ࢼ ‫ ݁ ݧ‬Ռ Λ ࣔ ͢ ɽ“Low”ʹ ண ໨ ͢ Δ ͱ ɼ. [8]. “Dur.+Pow.+UVC”͸ “HMM+VC”ͱൺֱͯ͠ɼ໌ྎੑ ͕վળ͍ͯ͠Δ͜ͱ͕෼͔Δɽ͜Ε͸ɼ“HMM+VC”ʹର ͠ɺ“Dur.+Pow.+UVC”͕࣋ͭԻӆิਖ਼ޮՌʹΑΓɼແ੠. [9]. ࢠԻԻૉͷ໌ྎੑ͕ճ෮ͨͨ͠Ίͩͱߟ͑ΒΕΔɽ·ͨɼ. “High”ʹ͓͍ͯ͸ɼ“Dur.+Pow.+UVC”͸ “HMM+VC” ͱಉ౳ͷ໌ྎੑͰɼ“Low”ΑΓ΋ߴ͍໌ྎੑ͕ಘΒΕ͍ͯ. [10]. Δɽͳ͓ɼ“Dur.+Pow.+UVC”͸ɼ“Native”ͱൺֱ͢Δͱɼ ୯‫ޠ‬ਖ਼ղਫ਼౓ͷྼԽΛ “High”ʹ͓͍ͯ໿ 5 ˋɼ“Low”ʹ ͓͍ͯ໿ 8 ˋʹͱͲΊΔ͜ͱ͕Ͱ͖Δɽ. [11]. 5. ͓ΘΓʹ ຊߘͰ͸ɼ೔ຊਓӳ‫ޠ‬Ի੠߹੒ʹ͓͚Δ࿩ऀੑΛอ࣋͠. [12]. ͨࣗવੑվળΛ໨తͱͯ͠ɼϞσϧదԠʹΑΔӆ཯ิਖ਼๏ ʹରͯ͠ɼධՁऀͷ฼‫ൃͱޠ‬࿩ऀͷӳ‫ޠ‬शख़౓͕༩͑ΔӨ. [13]. ‫͍ͯͭʹڹ‬ௐࠪ͠ɼ·ͨɼࢠԻεϖΫτϧิਖ਼ʹΑΔԻӆ ิਖ਼๏ΛఏҊͨ͠ɽ࣮‫ݧ‬తධՁʹΑΓɼ ʢ̍ʣύϫʔิਖ਼ʹ. [14]. ΑΔࣗવੑͷվળޮՌ͸ɼӳ‫ޠ‬฼‫ޠ‬࿩ऀʹΑΔධՁʹ͓͍ ͯ‫ݦ‬ஶͰ͋Δ͜ͱɼ ʢ̎ʣӳ‫ޠ‬शख़౓ʹؔΘΒͣɼӆ཯ิਖ਼ ๏ʹΑΓࣗવੑ͕վળ͢Δ͜ͱɼ ʢ̏ʣԻӆิਖ਼๏΋ࣗવੑ վળʹ༗ޮͰ͋Δ͜ͱΛࣔͨ͠ɽࠓ‫ޙ‬͸ɼ໨ඪ࿩ऀຖͷԻ ӆ‫ޡ‬Γʹ‫࠷ͮ͘ج‬దͳิਖ਼๏Λ‫ݕ‬౼͢Δඞཁ͕͋Δɽ ँࣙ ຊ‫ڀݚ‬ͷҰ෦͸ɼ ʢಠʣ৘ใ௨৴‫ߏػڀݚ‬ͷҕୗ‫ݚ‬ ‫ڀ‬ʮ஌ࣝɾ‫ޠݴ‬άϦουʹ‫ͮ͘ج‬ΞδΞҩྍަྲྀࢧԉγε ςϜͷ‫ڀݚ‬։ൃʯ͓Αͼ JSPS Պ‫ݚ‬අ 26280060 ͷॿ੒Λ ड͚࣮ࢪͨ͠΋ͷͰ͋Δɽ. [15]. [16]. ࢀߟจ‫ݙ‬ [1]. [2]. [3]. ߴಓ৻೭հɼେౡ༔࢘ɼ‫ాށ‬ஐ‫ج‬ɼGraham, N.ɼSakriani, S.ɼɹதଜ఩ɿ೔ຊਓӳ‫ޠ‬ͷͨΊͷԻ੠߹੒ٕज़Λ༻͍ ͨӳ‫ֶޠ‬शࢧԉͷ‫ݕ‬౼ɼ‫ڭ‬ҭγεςϜ৘ใֶձɼVol. 29, No. 5, pp. 111–116 (2015). Toda, T., Black, A. W. and Tokuda, K.: Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory, IEEE Trans. ASLP, Vol. 15, No. 8, pp. 2222–2235 (2007). Tokuda, K., Nankaku, Y., Toda, T., Zen, H., Yamagishi, J. and Oura, K.: Speech synthesis based on hidden. ⓒ 2015 Information Processing Society of Japan. [17]. [18]. Markov models, Proc. IEEE, Vol. 101, No. 5, pp. 1234– 1252 (2013). Yamagishi, J. and Kobayashi, T.: Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training, IEICE Trans. Inf. and Syst, Vol. 90, No. 2, pp. 533–543 (2007). Hattori, N., Toda, T., Kawai, H., Saruwatari, H. and Shikano, K.: Speaker-adaptive speech synthesis based on eigenvoice conversion and language-dependent prosodic conversion in speech-to-speech translation, Proc. INTERSPEECH, pp. 2769–2772 (2011). Liang, H., Qian, Y., Soong, F. K. and Liu, G.: A cross-language state mapping approach to bilingual (Mandarin-English) TTS, Proc. ICASSP, pp. 4641–4644 (2008). Qian, Y., Xu, J. and Soong, F. K.: A frame mapping based HMM approach to cross-lingual voice transformation, Proc. ICASSP, pp. 5120–5123 (2011). Minematsu, N., Tomiyama, Y., Yoshimoto, K., Shimizu, K., Nakagawa, S., Dantsuji, M. and Makino, S.: Development of English Speech Database Read by Japanese to Support CALL Research, Proc. ICA, Vol. 1, pp. 557–560 (2004). େౡ༔࢘ɼߴಓ৻೭հɼ‫ాށ‬ஐ‫ج‬ɼGraham, N.ɼSakriani, S.ɼɹதଜ఩ɿHMM Λ༻͍ͨ೔ຊਓӳ‫ޠ‬Ի੠߹੒ʹ͓ ͚Δ࿩ऀੑΛอ࣋ͨ͠ӆ཯ิਖ਼ɼ৴ֶٕใɼVol. 114, No. 365, pp. 63–68 (2014). ٢ଜ‫ࠀو‬ɼಙా‫ܙ‬Ұɼӹࢠ‫࢙و‬ɼখྛོ෉ɼɹ๺ଜਖ਼ ɿHMM ʹ‫ͮ͘ج‬Ի੠߹੒ʹ͓͚ΔεϖΫτϧɾϐονɾ ‫ܧ‬ଓ௕ͷಉ࣌ϞσϧԽɼ৴ֶ࿦ɼVol. J83-D2, No. 11, pp. 2099–2107 (2000). Tokuda, K., Yoshimura, T., Masuko, T., Kobayashi, T. and Kitamura, T.: Speech Parameter Generation Algorithms for HMM-based Speech Synthesis, Proc. ICASSP, Vol. 3, pp. 1315–1318 (2000). Kitamura, T. and Akagi, M.: Speaker Individualities in Speech Spectral Envelopes, Proc. ICSLP, Vol. 3, pp. 1183–1186 (1994). Kominek, J. and Black, A. W.: CMU ARCTIC databases for speech synthesis CMU Language Technologies Institute, Technical report, CMU-LTI-03-177 (2003). Kawahara, H., Masuda-Katsuse, I. and de Cheveign´e, A.: Restructuring Speech Representations Using a Pitch-adaptive Time-frequency Smoothing and an Instantaneous-frequency-based F0 Extraction: Possible Role of a Repetitive Structure in Sounds, Speech Commun., Vol. 27, No. 3-4, pp. 187–207 (1999). Zen, H., Tokuda, K., Masuko, T., Kobayashi, T. and Kitamura, T.: Hidden Semi-Markov Model Based Speech Synthesis System, IEICE Trans., Inf. and Syst., E90-D, Vol. 90, No. 5, pp. 825–834 (2007). Yamagishi, J., Nose, T., Zen, H., Ling, Z.-H., Toda, T., Tokuda, K., King, S. and Renals, S.: Robust SpeakerAdaptive HMM-Based Text-to-Speech Synthesis, IEEE Trans. ASLP, Vol. 17, No. 6, pp. 1208–1230 (2009). Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G. and Pallett, D. S.: DARPA TIMIT acoustic-phonetic continous speech corpus, Technical report, NISTIR 4930, NIST, Gaithersburg, MD (1993). Benoˆıt, C., Grice, M. and Hazan, V.: The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences, Speech Communication, Vol. 18, No. 4, pp. 381– 392 (1996).. 6.

(7)

Fig. 1 Model adaptation in HMM-based speech synthesis.
Table 1 Synthetic speech samples used for evaluation.
Fig. 6 Example of spectrograms of synthetic speech samples for a word fragment “consonants”

参照

関連したドキュメント

全体の集音範囲で 一定の感 度を持 つ特 性をフラットと呼び、集音した音は原音 に 忠 実となります。ある範 囲の 感

日本語接触場面における参加者母語話者と非母語話者のインターアクション行動お

 さて,日本語として定着しつつある「ポスト真実」の原語は,英語の 'post- truth' である。この語が英語で市民権を得ることになったのは,2016年

In order to estimate the noise spectrum quickly and accurately, a detection method for a speech-absent frame and a speech-present frame by using a voice activity detector (VAD)

噸狂歌の本質に基く視点としては小それが短歌形式をとる韻文であることが第一であるP三十一文字(原則として音節と対応する)を基本としへ内部が五七・五七七という文字(音節)数を持つ定形詩である。そ

patient with apraxia of speech -A preliminary case report-, Annual Bulletin, RILP, Univ.. J.: Apraxia of speech in patients with Broca's aphasia ; A

[r]

平成 28 年度は発行回数を年3回(9 月、12 月、3