saito18icassp tts poster

(1)

2. CONVENTIONAL ALGORITHMS



GAN-based TTS w/ vocoders

[4]



(1) Update discriminative models

� ⋅

.



(2) Update acoustic models

� ⋅

.

•

[References]

• [1] Goodfellow et al.,

Proc. NIPS,

2014. [2] Takaki et al.,

Proc. INTERSPEECH

, 2017. [3] Griffin et al.,

IEEE Trans. on ASLP

, 1984. [4] Saito et al.,

IEEE/ACM Trans. on ASLP,

2018. [5] Sonobe et al.,

arXiv:1711.00354,

2017.

3. PROPOSED ALGORITHMS

4. EXPERIMENTAL EVALUATION

•

Experimental conditions

TEXT-TO-SPEECH SYNTHESIS USING STFT SPECTRA BASED ON

LOW-/MULTI-RESOLUTION GENERATIVE ADVERSARIAL NETWORKS

Yuki Saito,

Shinnosuke Takamichi

_{, and Hiroshi Saruwatari}

The University of Tokyo, Japan

©Yuki

Saito, 18/04/2018

ICASSP 2018 SP-P6.6



Motivation



Amplitude spectra: high-dimensional features

including spectral & excitation characteristics.



GAN-based TTS

[3]

: effective for spectral

envelopes (i.e., mel-cepstral coefficients).

 

Dataset 4,007 utterances taken fromJapanese female speaker

(subset of JSUT[5] _{corpus, 3,808/199 for training/evaluation)}

STFT analysis Frame length: 400, shift length: 80, FFT length: 1024, analysis window: Hamming

Average pooling Zero padding: 6, pooling width �: {14, 30, 70}, stride: �

� and � 1.0

Dims. of / / 444 (linguistic feats, durations, Log F0, UV)/513/{74, 34, 14} (dim. of was changed in accordance with �)

DNN architectures Feed-Forward (see our paper)



DNN-based TTS w/ STFT spectra

[2]



(1) Generate raw STFT amplitude spectra.



(2) Reconstruct phase spectra by using



Griffin and Lim’s method

[3]

.



Pros.

avoiding a vocoding process



Cons.

over-smoothing of amplitude spectra

Linguistic

feats.

Natural

speech

params.

Generated

speech

params.

Acoustic

models

� ⋅

�

,

1:

natural

0: generated

Discriminative

models

� ⋅

�

_S

, =

− ⊤ −

� ⋅

�

A V

_1:

= �

_natural

,

� ⋅

�

, = −

_�

log �

_�

−

_�

log − �

_�

�

_,

(loss for

natural

params.)

(loss for

generated

�

,

params.)

�

_G

, = �

_S

, +

�

� � S

� �_{A V}

�

A V

Minimizing

approx. JS-divergence



Assumption:



low-res. spectra

≒

spectral envelopes

Average

pooling

Average

pooling

Natural

Generated

_{distribution difference}

Minimizing

e

.g

.,

5

1

3

d

im

.



GAN-based TTS w/ low-res. STFT spectra

 

Linguistic

feats. & Log F0

Natural

spectral

amplitudes

Generated

spectral

amplitudes

Acoustic

models

� ⋅

�

_{A V}

1:

natural

Low-res.

discriminative models

�

⋅

Average

pooling

�

_S

,

�

_G ow

, = �

_S

, +

�

_�� _�_{A V}S

�

_{A V}



GAN-based TTS w/ multi-res. STFT spectra

 

Approximately emulates the filter bank extraction



Subjective evaluation (preference AB tests)



MSE: minimizing

�

S

,

[2]



ADV: minimizing

�

G

,

[4]



ADV-Low: minimizing

�

G

ow

,

(proposed)



ADV-Multi: minimizing

�

G

l i

,

(proposed)



�

_G l i

, = �

_S

, +

�

� � S

� �_{A V}

�

A V

+

�

�� _{A V}S

�

A V



Results:



(1) ADV & ADV-

Multi didn’t improve speech quality.



⇒

Because of difficulty in minimizing



distribution differences in original resolution



(2) ADV-Low improved speech quality.



⇒

Effect of the spectral envelopes compensation

Natural MSE ADV

ADV-Low (proposed)

ADV-Multi (proposed)

Examples of 513-dimensional amplitude spectra

ADV-Low

(� = )

ADV-Low

(� = )

ADV-Low

(� = 7 )

MSE

: � < . : � > .

1. SYNOPSIS

•

In T

ext

-T

o

-S

peech

(TTS) synthesis w/ S

hort

-T

erm

F

ourier

T

ransform

(STFT) spectra:

•