• 検索結果がありません。

ZERO-SHOT CODE-SWITCHING ASR AND TTS WITH

N/A
N/A
Protected

Academic year: 2021

シェア "ZERO-SHOT CODE-SWITCHING ASR AND TTS WITH"

Copied!
1
0
0

読み込み中.... (全文を見る)

全文

(1)

5. Conclusion 1. Introduction

ZERO-SHOT CODE-SWITCHING ASR AND TTS WITH MULTILINGUAL MACHINE SPEECH CHAIN

Code-switching(CS):

switching languages within a conversation

CS challenges for ASR & TTS:

need to handle the multilingual input

Existing approaches

• Just developed on ASR or TTS

• Only focused on a single language pair

• Trained in a supervised fashion

Goal

• CS tasks of ASR and TTS on multiple language pairs

• Trained semi-supervised learning

Zero-shot CS:

performing the unknown CS w/o directly learning it

2. Proposed Method

Sahoko Nakayama 1,2 , Andros Tjandra 1 , Sakriani Sakti 1,2 , Satoshi Nakamura 1,2

1

Nara Institute of Science and Technology, Japan

2

RIKEN, Center for Advanced Intelligence Project AIP, Japan

{nakayama.sahoko.nq1, andros.tjandra.ai6, ssakti, s-nakamura}@is.naist.jp

3. ASR Evaluation

• Introduced a zero-shot CS ASR &TTS

• Proposed multilingual machine speech chain with LID

• Improved the performance of the multilingual CS

• Also performed well on the unknown CS without directly learning it

Proposed Machine Speech Chain [Tjandra et al., 2017,2018,2019]:

a closed-loop architecture that enables ASR & TTS to assist each other

Method: embedding Language Identification Discrimination (LID) to machine speech chain

I. Proposed model on zero-shot CS (Known language)

AB preference subjective evaluation

0% 20% 40% 60% 80% 100%

ZhEnCS JaZhCS EnJaCS

Without LngEmb Same Proposed (With LngEmb)

(a) Supervised learning on monolingual data

ASR+

LID

TTS

SPKREC

𝑦 = ["𝑐ℎ𝑟“ ,

"𝑙𝑛𝑔“]

𝑥 =

𝑥 =

𝑥 = 𝑧

𝑦 = ["𝑐ℎ𝑟“ ,

"𝑙𝑛𝑔“]

En Ja Zh

text

En Ja Zh speech

Monolingual train & test set

ASR+

LID TTS

SPKREC

𝑦 = ["𝑐ℎ𝑟“, "𝑙𝑛𝑔“]

𝑧 𝑥 =

𝑥 =

En-Ja Ja-Zh speech only 𝐿𝑜𝑠𝑠

b Unsupervised learning on code-switching data ASR+

LID

TTS 𝑦 = ["𝑐ℎ𝑟“, "𝑙𝑛𝑔“]

SPKREC

𝑦 = ["𝑐ℎ𝑟“ , "𝑙𝑛𝑔“]

𝑧

𝑥 =

En-Ja Ja-Zh text only

𝐿𝑜𝑠𝑠

CS train & test set Zh-En

CS train & test set Zero-shot

CS text

Zh-En Zero-shot CS speech ASR+

LID

TTS

𝑦 = ["𝑐ℎ𝑟“ , "𝑙𝑛𝑔“]

SPKREC

𝑦 = ["𝑐ℎ𝑟“ , "𝑙𝑛𝑔“]

𝑦 = ["𝑐ℎ𝑟“ , "𝑙𝑛𝑔“]

𝑧 𝑥 =

𝑥 =

𝑥 =

(b) Multi-speaker speech chain

with speaker recognition(SPKREC)

(a) Basic speech chain (c) Proposed multilingual speech chain

𝑥 = 𝑦 = "𝑐ℎ𝑟" ො

ASR

𝑥 = TTS

𝑦 = "𝑐ℎ𝑎𝑟"

𝑥 = 𝑦 = "𝑐ℎ𝑟" ො

𝑦 = "𝑐ℎ𝑟"

ASR TTS

𝑦 = "𝑐ℎ𝑟"

𝑦 = "𝑐ℎ𝑟" SPKREC 𝑧 𝑥 =

𝑥 =

𝑥 =

• Improved ASR in the multilingual CS test

• Performed well on an unknown CS test

Training Process for Multilingual Speech Chain

4. TTS Evaluation

Multilingual Speech Chain

Setup

Model: Tacotron TTS,

DeepSpeaker SPKREC

0%

10%

20%

30%

40%

FrZh test

Baseline

[Ja25k+En25k+Zh25k (paired)]

Proposed

[+EnJa10k+JaZh10k+EnFr5k (unpaired)]

Topline

[+EnJa10k+JaZh10k+EnFr5k (paired)]

II. Proposed model on zero-shot CS (Unknown language)

Model:

Encoder-decoder attention ASR Data: Monolingual BTEC

CS created from BTEC

Zero-shot CS

• Also improved on zero-shot CS with unknown Fr language

• Improved even in the case of using CS natural speech (see paper for more details)

Model Monolingual (CER) Code-switching (CER) Ja En Zh EnJaCS JaZhCS ZhEnCS Baseline: Supervised training on monolingual data only

Ja25k+En25k+Zh25k (paired) 8.85% 8.48% 5.11% 14.06% 16.91% 16.04%

Proposed Machine Speech chain: Semi-supervised training on two CS data

+ EnJaCS10k+JaZhCS10k (unpaired) 9.18% 12.71% 5.93% 11.56% 8.31% 10.52%

+EnJaCS10k+ZhEnCS10k (unpaired) 8.93% 12.34% 5.67% 11.18% 9.21% 9.71%

+ZhEnCS10k+JaZhCS10k (unpaired) 8.91% 14.45% 6.08% 11.85% 10.40% 11.29%

Topline: Supervised training on CS data

+EnJaCS10k+JaZhCS10k(paired) 10.18% 12.32% 7.93% 8.94% 6.70% 8.09%

+EnJaCS10k+ZhEnCS10k(paired) 11.04% 10.91% 7.48% 10.81% 7.26% 8.07%

+ZhEnCS10k+JaZhCS10k (paired) 10.98% 11.57% 7.22% 10.34% 7.72% 7.98%

+EnJaCS10k+JaZhCS10k+ZhEnCS10k 10.48% 10.43% 6.88% 8.68% 6.98% 8.05%

Language used as pair Language used as unpair Zero-shot CS

• Maintained TTS quality better

• Especially on the switch position between two languages

III. Proposed model on zero-shot CS (CS natural speech)

(CER)

参照

関連したドキュメント

The main purpose of this paper is to show, under the hypothesis of uniqueness of the maximizing probability, a Large Deviation Principle for a family of absolutely continuous

In the next result, we show that for even longer sequences over C 6 3 without a zero-sum subsequence of length 6 we would obtain very precise structural results.. However, actually

The behavior of the derivative of some Kubota- Leopoldt p-adic L-function with trivial zero has a deep relation with some arithmetic Iwasawa module (see [6]).. The second such

We estimate the standard bivariate ordered probit BOP and zero-inflated bivariate ordered probit regression models for smoking and chewing tobacco and report estimation results

Liu, “The base sets of primitive zero-symmetric sign pattern matrices,” Linear Algebra and Its Applications, vol.. Shen, “Bounds on the local bases of primitive nonpowerful

TriCor 4F herbicide tank mix combinations are recommended for preplant incorporated applications, pre-emergence surface applications, Split-Shot application and Extended

社会システムの変革 ……… P56 政策11 区市町村との連携強化 ……… P57 政策12 都庁の率先行動 ……… P57 政策13 世界諸都市等との連携強化 ……… P58

• Apply as required by scouting, usually at intervals of 5 or more days. Timing and frequency of applications should be based upon insect pop- ulations reaching locally