• 検索結果がありません。

Incremental Machine Speech Chain Towards Enabling Listening while Speakingin Real-time

N/A
N/A
Protected

Academic year: 2021

シェア "Incremental Machine Speech Chain Towards Enabling Listening while Speakingin Real-time"

Copied!
21
0
0

読み込み中.... (全文を見る)

全文

(1)

Incremental Machine Speech Chain

Towards Enabling Listening while Speaking in Real-time

INTERSPEECH 2020

(2)

Outline

I. Introduction

II. Incremental Machine Speech Chain III. Experiments

IV. Conclusion

2

(3)

I. Introduction

I. Introduction

II. Incremental Machine Speech Chain

III. Experiments IV. Conclusion

3

(4)

Background

ASR and TTS

• Spoken language technologies:

o Automatic speech recognition (ASR) o Text-to-speech synthesis (TTS)

• Crucial for human-machine interaction

• Remarkable performance

requires a lot of speech-text paired data

ASR and TTS systems

ASR “Hello”

“Hello” TTS

4

(5)

Background

Machine Speech Chain

[Tjandraet al., 2017]

Semi-supervised ASR and TTS training via closed feedback loop

ASR/TTS : standard attention-based seq2seq network

2 training phases:

1) ASR/TTS supervised independent training 2) ASR/TTS unsupervised joint training with

feedback loop

Full-utterance-based ASR and TTS  High delay

5

ASR→ TTS (speech only) TTS→ ASR (text only)

ASR

TTS time

𝐿𝑜𝑠𝑠𝑇𝑇𝑆

ASR time

TTS

𝐿𝑜𝑠𝑠𝐴𝑆𝑅 unroll

(6)

Background

Human Speech Chain

6

Human speech chain [Denes, 1993]

Feedback loop between speech production and hearing systems

Real-time process  immediate adaptation

Feedback delay causes a disturbance during speaking Challenge in mimicking human speech chain for machine

Speech generation or recognition and feedback generation based on incomplete sequence information with minimum delay

Propose : Incremental Machine Speech Chain

(7)

II. Incremental Machine Speech Chain

7

I. Introduction

II. Incremental Machine Speech Chain

III. Experiments

IV. Conclusion

(8)

Closed short-term feedback loop between

incremental ASR (ISR) and incremental TTS (ITTS)

Reduce feedback delay within machine speech chain training

Improve ISR and ITTS learning quality

Enable immediate feedback generation during inference

Move a step closer for ASR and TTS that can adapt to real-time environment unsupervisedly

Similar to human

Objective

Propose

Incremental Machine Speech Chain

8

ISR

ITTS ISR

ITTS

ITTS ITTS

ISR ISR

Incremental Framework (proposed)

Delay

ISRITTS

Delay

ITTSISR ASR

TTS

TTS

ASR

Basic Framework

Delay

Delay

TTSASR ASRTTS

TTS

Unrolled processes in machine speech chain loop

(9)

Incremental Machine Speech Chain

Components

Incremental ASR (ISR): Low delay ASR

Hidden Markov model ASR

End-to-end ISR with attention-based seq2seq model o Neural transducer [Jaitly et al, 2016]

o Attention-transfer ISR [Novitasari et al., 2019]

Incremental (ITTS): Low delay TTS

Hidden Markov model TTS

End-to-end ITTS with attention-based seq2seq model o Neural ITTS [Yanagita et al., 2019]

o ITTS based on prefix-to-prefix framework [Ma et al., 2019]

• Performance limitation due to short-input-based processing

• Previous: independent development

ISR ISR

good morning

...

ISR

good

ITTS

morning

ITTS

...

ITTS Delay

9

(10)

Incremental Machine Speech Chain

Training Mechanism

10

2 training phases:

1. ISR and ITTS supervised-independent training

2. ISR and ITTS joint training via short-term feedback loop

(11)

ISR

Enc Dec

Att

Full speech (𝑋)

<m> d e </m>

x9,… x16

ISR

Step n = 1 Step n = 2

ITTS

a b c </m>

Output Text (Yn)

x1,… x8

Input Speech

(Xn) Input Text(Yn) <m> a b c </m>

x1,… x8

Output Speech (Xn)

ISR

Enc Dec

Att

ITTS

Enc Dec

Att

ITTS

Enc Dec

Att

Speech Frame Block ID TokenText

ID

Attention alignment from standard ASR

Incremental Machine Speech Chain Training

1. ISR and ITTS Independent Training

Incremental : Predict a complete output sequence in N steps.

For each step n :

1. Encode a segment of input from input window 2. Decode and predict a segment of output

3. Shift the input windows

ISR and ITTS training by attention transfer from standard non-

incremental ASR [Novitasari et al., 2019] same alignment for ISR and ITTS

d e </m>

x9,… x16

Step n = 1 Step n = 2

Full text a b c d e f

(𝑌)

Alignment info.

Alignment info.

11

(12)

Short-term feedback loop between the components

Segment-based output passing

Unrolled processes a. ISR-to-ITTS

For each step n, ISR predicts 𝑌𝑛 from 𝑋𝑛 , and then ITTS predicts 𝑋𝑛 from ISR output 𝑌𝑛

b. ITTS-to-ISR a. ISR-to-ITTS

12

ISR 𝑦

𝑛=1

=

“ a b c ”

𝑥

𝑛=1

=

ITTS

𝑥

𝑛=1

=

𝐿𝑜𝑠𝑠𝑇𝑇𝑆𝑛=1(𝑥𝑛=1, 𝑥𝑛=1)

ITTS

ISR

𝑦

𝑛=2

=

“ d e ”

𝑥

𝑛=2

=

𝑥

𝑛=2

=

𝐿𝑜𝑠𝑠𝑇𝑇𝑆𝑛=2(𝑥𝑛=2, 𝑥𝑛=2)

Step n= 1 Step n= 2

Full speech = (𝑋)

Incremental Machine Speech Chain Training

2. ISR and ITTS Joint Training

(13)

Short-term feedback loop between the components

Segment-based output passing

Unrolled processes a. ISR-to-ITTS

For each step n, ISR predicts 𝑌𝑛 from 𝑋𝑛 , and then ITTS predicts 𝑋𝑛 from ISR output 𝑌𝑛

b. ITTS-to-ISR

For each step n, ITTS predicts 𝑋𝑛 from 𝑌𝑛, and then ISR predicts 𝑌𝑛 from ITTS output 𝑋𝑛

b. ITTS-to-ISR

ITTS

𝑦

𝑛=1

= “ a b c ”

𝑥

𝑛=1

=

𝐿𝑜𝑠𝑠𝐴𝑆𝑅𝑛=1(𝑦𝑛=1, 𝑦𝑛=1) 𝐿𝑜𝑠𝑠𝐴𝑆𝑅𝑛=2(𝑦𝑛=2, 𝑦𝑛=2)

Step n = 1

ITTS ISR

𝑦

𝑛=2

= “ d e ” 𝑥

𝑛=2

=

𝑦

𝑛=2

=“ d e ”

Full text =

a b c d e f

(𝑌)

Step n = 2

ISR

𝑦

𝑛=1

= “ a b c ” Incremental Machine Speech Chain Training

2. ISR and ITTS Joint Training

13

(14)

Exploration on 2 learning approaches:

A) Semi-supervised incremental machine speech chain 1) ISR/ITTS independent training : supervised

2) ISR/ITTS joint training: unsupervised (unlabeled data)

B) Supervised incremental machine speech chain 1) ISR/ITTS independent training : supervised

2) ISR/ITTS joint training : supervised (labeled data)

ISR 𝑦

𝑛

=

“ a b c ”

𝑥

𝑛

=

ITTS

𝑥

𝑛

=

Unsupervised Greedy

Unrolled process examples in joint training

(ITTS-to-ISR follows similar mechanism)

Incremental Machine Speech Chain

Learning Approach

Supervised Teacher-

forcing

ISR

“ a b c ” ITTS

Correct text

(A) (B)

14

(15)

III. Experiments

15

I. Introduction

II. Incremental Machine Speech Chain

III. Experiments

IV. Conclusion

(16)

Experiments

Dataset

Wall Street Journal CSR Corpus [Paul and Baker, 1992]

Language : English

Training sets:

o SI-84 : 16 hours of speech, 83 speakers

o SI-200 : 66 hours of speech, 200 speakers

o SI-284 : si84 + si200

Dev. set : dev93

Eval. set : eval92

Character-level

Speech features: 80-dims log Mel spectrogram (window: 50 msec, shift: 12.5 msec)

16

(17)

Experiments

Model Configuration

TTS

17

* Same architecture for standard (non-incremental) and incremental models

ASR

y0 y1 y2 yT-1 y1 y2 y3 yT

FNN BiLSTM BiLSTM BiLSTM

Encoder Decoder

Char Emb.

LSTM

Attention

Hierarchical sub-sampling 8 feature frames into 1 encoder state

x1 x2 x3 x4 x5 x6 x7 x8 xS-1 xS

Transcription

Speech features

Input/step

ISR : 0.84 sec

Std. ASR : full-utterance (avg. 7.88 sec)

Input/step

ITTS : avg. 30 chars

Std. TTS : full-sentence (avg. 103 chars)

Tacotron 2 [Wang et al., 2017] structure with speaker embedding [Tjandra et al., 2018]

Attention x5,…, x8

x1,…, x4 xT-3,…, xT

Speaker Emb.

y0 y1 y2 yT-1

Char Emb.

3 Conv BiLSTM

x0 x1,…, x4 xT-7,…, xT-4 Encoder

2 Pre-Net Linear Proj.

2 LSTM

Decoder

Transcription Speech features

Mel-to-linear CBHG

(18)

Data

ASR (CER%) TTS (L2-norm)2

Standard

(delay: 7.88 sec)

Incremental (delay: 0.84 sec)

Standard (delay: 103 chars)

Incremental (delay: 30 chars) nat-sp syn-sp nat-sp syn-sp nat-txt rec-txt nat-txt rec-txt Independent Training

Indep-trnSI-84 17.33 27.03 17.81 44.54 0.99 1.02 1.04 3.62

Indep-trnSI-284 7.16 9.60 7.97 19.99 0.75 0.77 0.84 1.31

Machine Speech Chain Indep-trn (SI-84) +

chain-trn-greedy (SI-200) 11.21 11.52 14.23 32.43 0.80 0.82 0.86 1.35 Indep-trn (SI-84) +

chain-trn-teachforce(SI-200) 7.27 6.30 9.43 12.78 0.77 0.80 0.79 1.26

ISR

ITTS

“ ”

nat-sp

rec-txt

ISR “ ”

ITTS “ ”

syn-sp

nat-txt

Result

Incremental machine speech chain o Improved ISR and ITTS

o Shorter delay with a close performance to the standard system

ASR (CER%) and TTS (log Mel-spectrogram L2 loss) performances

Baseline

ISR and ITTS indep-trn SI-84

Topline

Standard systemsindep-trn SI-284

Proposed

Incremental machine speech chain

Input type:

18

(19)

IV. Conclusion

19

I. Introduction

II. Incremental Machine Speech Chain

III. Experiments

IV. Conclusion

(20)

Conclusion

Incremental machine speech chain

Short-term feedback loop for ISR/ITTS development by mimicking human speech chain o Reduced the delay with a close performance to the basic framework

o Improve ISR and ITTS (natural/synthetic input)

o Synthetic input processing: demonstration of real-time feedback generation

20

(21)

Thank you

21

参照

関連したドキュメント

In order to estimate the noise spectrum quickly and accurately, a detection method for a speech-absent frame and a speech-present frame by using a voice activity detector (VAD)

patient with apraxia of speech -A preliminary case report-, Annual Bulletin, RILP, Univ.. J.: Apraxia of speech in patients with Broca's aphasia ; A

For a brief history of the Fekete- Szeg¨o problem for class of starlike, convex, and close-to convex functions, see the recent paper by Srivastava et

The proposed model in this study builds upon recent developments of integrated supply chain design models that simultaneously consider location, inventory, and shipment decisions in

In the second computation, we use a fine equidistant grid within the isotropic borehole region and an optimal grid coarsening in the x direction in the outer, anisotropic,

Keywords: continuous time random walk, Brownian motion, collision time, skew Young tableaux, tandem queue.. AMS 2000 Subject Classification: Primary:

We study the stabilization problem by interior damping of the wave equation with boundary or internal time-varying delay feedback in a bounded and smooth domain.. By

The focus has been on some of the connections between recent work on general state space Markov chains and results from mixing processes and the implica- tions for Markov chain