ίϛϡχέʔγϣϯͷϝΧχζϜͷ੍׆༻ʹΑΔ
Իೝࣝਫ਼্ͷ֓೦ݕূ
Improving Speech Recognition by Utilizing Communication-field Mechanism Constraints
ӹҪ ത࢙
Masui Hirofumiதྛ Ұو
Nakabayashi Kazuki୩ޱ େ
Taniguchi Tadahiro໋ཱؗେֶใཧֶ෦
College of Information Science and EngineeringɼRitsumekan University
Communication-field mechanism design includes rules and incentives to indirectly control the communication of a group of people, e.g., discussion, debate, meeting, and consultation, by introducing constraints to the communica-tion process. Similarly, we hypothesize that such constraints are beneficial for the applicacommunica-tion of speech recognicommunica-tion technologies based on artificial intelligence. In this paper, we evaluate this hypothesis by using an automatic speech recognition system with Dealing Rights to Speak (DRS) as a proof of concept. Our experimental results show that the introduction of DRS can effectively improve the performance of the speech recognition system.
1. ͡Ίʹ
ίϛϡχέʔγϣϯͷϝΧχζϜσβΠϯͱɼϧʔϧ ΠϯηϯςΟϒͷ੍ઃܭʹΑΓɼίϛϡχέʔγϣϯͷ վળΛࢦ͢ΞϓϩʔνͰ͋Δ [୩ޱ11,୩ޱ19]ɽίϛϡχ έʔγϣϯͷϝΧχζϜɼ੍Λ௨ͯͦ͠ͷͷࢀՃऀͷ ൃͷߦಈʹมԽΛٴ΅͢ɽݴ͍͑Δͱɼ͜ͷ੍͕ࢀՃ ऀͷൃݴ࣌ؒൃ༰ͱ͍ͬͨίϛϡχέʔγϣϯߦಈʹɼ ҰछͷߏΛ༩͑Δ͜ͱʹͳΔɽൃ͕ߏԽ͞ΕΔ͜ͱʹΑ ΓɼԻೝࣝݴޠॲཧͷਓೳٕज़ʹੑೳ্Λ༩͑Δ Մೳੑ͕ߟ͑ΒΕΔɽ ຊจͰ͜ͷίϛϡχέʔγϣϯͷϝΧχζϜͷ੍׆ ༻ʹΑΔਓೳٕज़ͷਫ਼্ͱ͍͏γφϦΦͷ֓೦࣮ূΛ ߦ͏ͨΊɼൃݖऔҾͱԻೝٕࣝज़Λྫͱ͠ɼͦͷଥੑΛ ඃݧऀ࣮ݧΛ௨࣮ͯ͠ূతʹݕূ͢Δ͜ͱΛతͱ͢Δɽ2. ݚڀഎܠ
ൃݖऔҾͱɼࢀՃऀͷൃݴ࣌ؒΛۉʹ͚ۙͮͭͭɼඞ ཁʹԠͯ͡ΑΓଟ͘ൃݴ͢Δࣗ༝Λͨ͠ɼ͠߹͍վળͷ ͨΊͷϝΧχζϜͱͯ͠ݹլΒʹΑͬͯఏҊ͞ΕͨͷͰ͋Δ ʢৄࡉ[ݹլ14]ʣɽ ൃݖऔҾͰɼൃݖͱݺΕΔΧʔυࢀՃऀʹ͠ɼͦ ΕΛ֤ࢀՃऀ͕ࣗࢄతʹ༥௨ɾ༻͢Δ͜ͱʹΑͬͯɼൃ ݴ࣌ؒͷॴ༗ݖ੍͕ޚ͞ΕΔϝΧχζϜͰ͋Δɽ͜Ε͕ɼ࢘ձ ͷෛՙΛݮΒ͠ɼൃݴ࣌ؒͷۉԽΛଅ͠ɼҙࢥදࣔཧ༝ʹ ؔ͢Δൃ༰Λ૿͠ɼશମͱͯ͠͠߹͍ͷޮੑΛߴΊ ΔޮՌ͕͋Δͱใࠂ͞Ε͍ͯΔ[ݹլ14]ɽຊݚڀͰண͢Δ ͷɼ͜ͷϝΧχζϜͷؼ݁ͱͯ͠ɼ֤ࢀՃऀͷൃݴ͕Ұఆ࣌ ִؒؒͰ۠ΒΕɼ·ͨɼݪଇతʹൃͷඃΓ͕ͳ͘ͳΓɼ ͠߹͍͕ਐߦ͢ΔΑ͏ʹͳΔͱ͍͏ଆ໘Ͱ͋Δɽ͜ͷ݁Ռͱ ͯ͠෭࡞༻తʹɼࣗಈԻೝࣝࣗಈٞࣄ࡞͕༰қʹͳΔ ͜ͱ͕ظ͞ΕΔɽ2.1 Իೝٕࣝज़
Իೝٕࣝज़ਓೳʹΑΔओͨΔٕज़։ൃՌͰ ͋ΔɽεϚʔτεϐʔΧʔεϚʔτϑΥϯͳͲʹ͓͍ͯࣗવ ൃ͔ΒͷԻೝٕࣝज़༻ར༻͞Ε͍ͯΔɽ͔͠͠ɼձٞ ࿈ བྷ ઌ: ӹ Ҫ ത ࢙ ɼཱ ໋ ؗ େ ֶ ɼ ใ ཧ ֶ ෦ ɼ [email protected] ΧδϡΞϧͳ͠߹͍ͷٞࣄࣗಈ࡞Λ࢝Ίͱͯ͠ɼଟਓ ͷৗձͷద༻ະͩʹेͳਫ਼ΛಘΔ͜ͱ͕͍͠ ͱݴΘΕΔɽɹ ଟਓձʹରͯ͠ԻೝࣝثΛద༻͢Δࡍʹੜ͡Δɼయ ܕతͳͱͯ͠ɼಉ࣌ൃͷ͕͋Δɽ͜ΕෳͷࢀՃ ऀ͕ಉ࣌ʹൃ͢Δ͜ͱͰɼԻ͕ඃΓԻೝ͕ࣝࠔʹͳͬ ͯ͠·͏ͷͰ͋Δɽ͜ͷͷղܾࡦͱͯ͠ɼऀٕज़ ͷ׆༻͕ߟ͑ΒΕΔ∗1͕ɼऀٕज़ɼԻೝٕࣝज़ͱ ҟͳΓɼݱঢ়Ͱ୭͕طͰ༰қʹ༻͍ΒΕΔঢ়گʹࢸͬ ͍ͯͳ͍ɽ ຊߘͰɼίϛϡχέʔγϣϯͷϝΧχζϜσβΠϯ͕ਓ ೳٕज़ͷ׆༻ʹߩݙ͢Δͱ͍͏ݱͷࣄྫͱͯࣔͨ͢͠ ΊɼԻೝࣝثͦͷͷͷٕज़తͳҙຯͰͷੑೳվળߦΘͣ ʹɼൃݖऔҾͷ͠߹͍ͷಋೖ͕ಉ࣌ൃΛ੍͢Δ͜ͱ Ͱɼٞࣄ࡞ʹ͓͚ΔԻೝࣝثͷ׆༻ʹߩݙ͢Δ͜ͱΛ ࣔ͢ɽ3. ࣮ݧ
3.1 ࣮ݧత
ຊ࣮ݧͰɼൃݖऔҾΛಋೖͨ͠͠߹͍ͱɼಛஈͷ੍ Λઃ͚ͳ͍͠߹͍ͷٞࣄॻ͖ى͜͠ʹطͷԻೝࣝث Λద༻͠ɼൃݖऔҾͷಋೖ͕ͨΒ͢ӨڹΛ໌Β͔ʹ͢Δɽ ൃݖऔҾͰɼ੍ʹΑͬͯൃݖ͕ཧ͞ΕΔͨΊʹɼಉ ࣌ʹෳͷࢀՃऀ͕ൃݴ͢Δػձ͕ݮΔɽ͜ͷੑ࣭ΑΓɼൃ ݖऔҾΛ༻͢Δ͜ͱͰɼಉ࣌ൃ͕ࣗવͱ੍͞Εɼ݁Ռత ʹԻೝࣝਫ਼ͷ্͕ظ͞ΕΔɽ͜ΕΛ࣮ূతʹ໌Β͔ʹ ͢ΔͨΊʹ࣮ݧΛߦͬͨɽ3.2 ࣮ݧ݅
ຊ࣮ݧͰɼൃݖऔҾΛಋೖͨ͠͠߹͍ͱϑϦʔσΟε ΧογϣϯʹΑΔ͠߹͍Λൺֱ͢ΔɽൃݖऔҾʹ͓͚Δൃ ݖͷཧʹઌߦݚڀ[ӹҪ19]Ͱ։ൃ͞ΕͨεϚʔτϑΥ ϯΞϓϦΛ༻͍ͨ∗2ɽҰํɼϑϦʔσΟεΧογϣϯͰςʔ ϚͷΈΛ࣮ݧࢀՃऀʹ௨͠ɼࣗ༝ʹٞͯ͠Βͬͨɽ ͦΕͧΕͷձࢀՃऀͷதԝʹઃஔͨ͠ϚΠΫʹΑΓ ∗1 ྫ͑ [Nakadai 10] ͳͲ ∗2 ൃݖऔҾΞϓϦ https://apps.apple.com/us/app/ൃݖऔ ҾΞϓϦ/id14492300801
The 34th Annual Conference of the Japanese Society for Artificial Intelligence, 2020
Իͨ͠∗3ɽ͜ΕΛʮશपғԻʯͱݺͿ͜ͱʹ͢Δɽ·ͨɼ࣮ ݧ݁ՌͷൺֱͷͨΊɼऀ͕దʹͳ͞ΕͨԻͷ฿ͱ ֤ͯ͠ࢀՃऀʹϐϯϚΠΫΛऔΓ͚ɼͦΕͧΕͷऀͷൃ ͚͕ͩԻ͞ΕΔԻσʔλऩूͨ͠ɽ͜ΕΛʮϐϯϚΠ ΫԻʯͱݺͿ͜ͱʹ͢Δɽ
ͦΕͧΕͷԻʢ̍ʣGoogle Cloud Speech APIΛ༻͍ ͨԻೝࣝʹΑΔࣗಈॻ͖ى͜͠∗4ɼʢ̎ʣਓखʹΑΔॻ͖ى ͜͠Λߦ͏ɽʢ̍ʣΛඪ४తͳԻೝࣝثʹΑΔॻ͖ى͜͠ͱ ݟͳ͠ɼʢ̎ʣΛਖ਼ղσʔλͱݟͳ͢ɽ
ఆྔධՁͰɼશͯͷೝࣝ݁ՌΛͻΒ͕ͳදهʹ։͖ɼͻ Β͕ͳදهͰͷจࣈͷೝࣝͷޡΓ(Character Error Rateɼ
CER)Λ༻͍ͯධՁ͢Δ͜ͱΛجຊͱͨ͠ɽ ࣮ݧΛߦ͏ࡍʹҙ͢Δͱͯ͠ɼਓһͷબఆɼςʔϚͷબ ఆɼ࣮ݧΛߦ͏ࡍͷ׳ΕʹىҼ͢ΔॱংޮՌ͕ڍ͛ΒΕΔɽ͜ ͷͨΊɼ͜ΕΒΛߟྀͨ͠ਓһςʔϚɼ࣮ࢪॱΛ࠾༻ͨ͠ɽ ৄࡉͳ࣮ݧ݅ʹؔͯ͠ɼޱ಄ൃදʹ͓͍ͯใࠂ͢Δɽ ࣮ݧͷશମతͳखॱΛҎԼʹࣔ͢ɽ 1. 4ਓʹΑΔ6ؒͷൃݖऔҾͱϑϦʔσΟεΧογϣ ϯΛߦ͍ɼԻ͢Δɽ͜ΕΛ4ճͣͭߦ͏άϧʔϓΛ4 ༻ҙ͢Δɽ࠷ऴతʹɼൃݖऔҾͱϑϦʔσΟεΧο γϣϯͷσʔλ͕ͦΕͧΕ16ύλʔϯͣͭಘΒΕΔɽ 2. Ի͢Δࡍશһͷձ͕Ի͕Ͱ͖ΔϘΠεϨίʔμʔ Λ্தԝʹஔ͖ɼશपғԻΛಘΔɽ·ͨɼಉ࣌ʹ֤ ࢀՃऀ͕ࢦੑͷϐϯϚΠΫΛ͚ͭɼऀࡁΈσʔ λͱΈͳ͢ϐϯϚΠΫԻΛಘΔɽ
3. ԻσʔλΛऩू͠ɼGoogle Cloud Speech APIʹΑΔ จࣈॻ͖ى͜͠ͱਓखʹΑΔจࣈॻ͖ى͜͠Λߦ͏ɽ
4. ਓखʹΑΔจࣈॻ͖ىͨ͜͠͠σʔλΛਖ਼ղσʔλͱ͠ɼ
Google Cloud Speech APIͷจࣈى͜͠σʔλͱਓखʹ ΑΔจࣈॻ͖ىͨ͜͠͠σʔλΛൺֱ͢Δ͜ͱͰจࣈى ͜͠ͷԻೝࣝਫ਼Λݕূ͢Δɽ
3.3 ࣮ݧ݁Ռͷݕ౼
·ͣɼൃݖऔҾͷಋೖ͕Իೝࣝਫ਼Λվળ͔ͨ͠Ͳ͏͔ ʹؔͯ͠ɼओͨΔ࣮ݧ݁ՌΛݕ౼͢Δɽ શपғԻʹ͓͚ΔൃݖऔҾ݅ͱϑϦʔσΟεΧογϣ ϯ݅ͷCERฏۉΛൺֱͨ͠ͱ͜ΖɼൃݖऔҾ݅ͷํ ͕༗ҙʹ͔ͬͨɽ͜ΕʹΑΓൃݖऔҾͷಋೖ͕Իೝࣝث ͷਫ਼Λ্ͤ͞Δͱ͍͏Ծઆ͕ࢧ࣋͞Εͨɽ ࣍ʹɼͦͷਫ਼վળͷ߹͍ʹ͍ͭͯݕ౼͢ΔɽൃݖऔҾ ಋೖͷޮՌ͕ऀͱಉఔͰ͋ΕɼϐϯϚΠΫͰԻ͠ ͨ݅ͱಉఔͷೝࣝਫ਼Λࣔͣ͢Ͱ͋Δɽ͜ͷͨΊʹશप ғԻͷൃݖऔҾ݅ͱɼϐϯϚΠΫͷೋ݅Λൺֱ͢Δɽ ͜ͷ݁ՌɼϐϯϚΠΫ͕݅༗ҙʹ͔ͬͨɽ͜ΕΑΓൃݖ औҾԻೝࣝਫ਼Λվળ͢Δͷͷɼͦͷఔ΄΅શͳ ऀʹٴͳ͍͜ͱ͕Θ͔ΔɽҰํͰɼϐϯϚΠΫͷ̎ ݅ͷؒʹ༗ҙ͕ࠩͳ͔ͬͨɽऀ͕ߦΘΕͨ݅Լʹ ͓͍ͯɼൃݖऔҾͷޮՌফ͍͑ͯΔ͜ͱ͔Βɼൃݖऔ ҾͷͨΒ͢Իೝࣝਫ਼վળͷػೳɼऀʹؔΘΔ ͷͰ͋Δ͜ͱ͕ࣔࠦ͞ΕΔɽ∗3 ԻʹεϚʔτϑΥϯ HUAWEI nova lite 2 ͷଂͷϚΠΫ
Λ༻͍ɼϘΠεϨίʔμʔͷΞϓϦΛ༻͍ͨɽ
∗4 Google Speech API ʹؔͯ͠ݚڀΛߦͬͨ 2019 11 ݄ݱࡏͰ
ར༻ՄೳͳͷͷσϑΥϧτઃఆΛ༻͍ͨɽऀμΠΞϥΠθʔγϣ ϯػೳؚ·Ε͍ͯͳ͍ͷΛ༻͍͍ͯΔɽCloud SpeechtoText -Google Cloud: https://cloud.google.com/speech-to-text/ ?hl=ja Ҏ্ΑΓɼൃݖऔҾΛ༻͍Δ͜ͱͰɼԻೝࣝͷਫ਼ͷ ্͕ՄೳͰ͋Δ͜ͱ͕֬ೝ͞Εͨɽৄࡉͳ݁Ռͱͦͷݕ౼ʹͭ ͍ͯޱ಄ൃදʹ͓͍ͯใࠂ͢Δɽ
4. ·ͱΊ
ຊจͰίϛϡχέʔγϣϯͷϝΧχζϜͷ੍׆༻ʹ ΑΔਓೳٕज़ͷਫ਼্ͱ͍͏γφϦΦͷ֓೦࣮ূͷͨ ΊɼൃݖऔҾͱԻೝٕࣝज़Λྫͱͯ͠औΓ্͛ɼͦͷଥ ੑΛ࣮ূతʹݕূͨ͠ɽطͷϚΠΫͱԻೝࣝثΛ༻͍ɼ ϑϦʔσΟεΧογϣϯͱൃݖऔҾͷ݅ԼͰ͠߹͍Λ Իɼೝࣝ͢Δ͜ͱʹΑΓɼఆྔతͳൺֱΛߦͬͨɽͦͷ݁Ռɼ ൃݖऔҾΛ༻͍ͨ߹ɼϑϦʔσΟεΧογϣϯʹൺɼԻ ೝࣝਫ਼্͕͢Δ͜ͱ͕͔ͬͨɽ͔͠͠ɼϐϯϚΠΫΛ ༻͍ͨ߹CER͕ΑΓ͍ͱͳ͓ͬͯΓɼશͳऀ Λఆͨ͠ϐϯϚΠΫԻΛߦͬͨ߹ʹൺΔͱɼ͍Ի ೝࣝͷਫ਼ʹཹ·Δ͜ͱ͕͔ͬͨɽ Իೝࣝࣗવݴޠॲཧͱ͍ͬͨਓೳٕज़Λ׆༻͢Δ ͜ͱͰՄೳͱͳΔίϛϡχέʔγϣϯͷϝΧχζϜσβΠϯ ͕͋ΔҰํͰɼຊจͰݕ౼ͨ͠Α͏ʹίϛϡχέʔγϣϯ ͷϝΧχζϜͷ੍ʹΑΓਫ਼্ΛਤΕΔਓೳٕज़ ͋Δɽ[୩ޱ19]Ͱओு͞ΕΔΑ͏ʹɼ͜ͷΑ͏ͳڞਐԽతؔ ΛߟྀʹೖΕͭͭɼࠓޙͷݚڀ։ൃΛਐΊΔ͖Ͱ͋Δͱߟ ͑Δɽँࣙ
ຊݚڀJSTະདྷࣾձࣄۀJPMJMI17C7ͷࢧԉΛड ͚ͨͷͰ͋Δɽࢀߟจݙ
[Nakadai 10] Nakadai, K., Takahashi, T., Okuno, H. G., Nakajima, H., Hasegawa, Y., and Tsujino, H.: Design and Implementation of Robot Audition System’HARK’
ʕOpen Source Software for Listening to Three Simulta-neous Speakers, Advanced Robotics, Vol. 24, No. 5-6, pp. 739–761 (2010)
[ӹҪ19] ӹҪ ത࢙,େౡ ਸ߂,୩ޱ େɿൃݖऔҾϞόΠ ϧΞϓϦέʔγϣϯΛ༻͍ͨσΟεΧογϣϯͷੳ,ਓ ೳֶձશࠃେձจू, Vol. JSAI2019, pp. 2F4OS5b03– 2F4OS5b03 (2019) [ݹլ14] ݹլ ༟೭,୩ޱ େɿൃݖऔҾɿ͠߹͍ͷʹ ͓͚Δ࣌ؒͷϝΧχζϜσβΠϯ,ຊܦӦֶձจ ࢽ, Vol. 65, No. 3, pp. 144–156 (2014) [୩ޱ11] ୩ޱେ,ਢ౻लɿίϛϡχέʔγϣϯͷϝΧχζ ϜσβΠϯ:ϏϒϦΦότϧͱൃݖऔҾΛࣄྫͱͯ͠,γ εςϜ੍ޚใֶձࢽ, Vol. 55, No. 8, pp. 339–344 (2011) [୩ޱ19] ୩ޱେɿίϛϡχέʔγϣϯͷϝΧχζϜσβ Πϯʹ͚ͨγεςϜͷߏஙͱల,γεςϜ੍ޚใֶ ձจࢽ, Vol. 32, No. 12, pp. pp. 417–428 (2019), ট จ