• 検索結果がありません。

saito18icassp tts poster

N/A
N/A
Protected

Academic year: 2018

シェア "saito18icassp tts poster"

Copied!
1
0
0

読み込み中.... (全文を見る)

全文

(1)

2. CONVENTIONAL ALGORITHMS

GAN-based TTS w/ vocoders

[4]

(1) Update discriminative models

� ⋅

.

(2) Update acoustic models

� ⋅

.

[References]

• [1] Goodfellow et al.,

Proc. NIPS,

2014. [2] Takaki et al.,

Proc. INTERSPEECH

, 2017. [3] Griffin et al.,

IEEE Trans. on ASLP

, 1984. [4] Saito et al.,

IEEE/ACM Trans. on ASLP,

2018. [5] Sonobe et al.,

arXiv:1711.00354,

2017.

3. PROPOSED ALGORITHMS

4. EXPERIMENTAL EVALUATION

Experimental conditions

TEXT-TO-SPEECH SYNTHESIS USING STFT SPECTRA BASED ON

LOW-/MULTI-RESOLUTION GENERATIVE ADVERSARIAL NETWORKS

Yuki Saito,

Shinnosuke Takamichi

, and Hiroshi Saruwatari

The University of Tokyo, Japan

©Yuki

Saito, 18/04/2018

ICASSP 2018 SP-P6.6

Motivation

Amplitude spectra: high-dimensional features

including spectral & excitation characteristics.

GAN-based TTS

[3]

: effective for spectral

envelopes (i.e., mel-cepstral coefficients).

 

Dataset 4,007 utterances taken fromJapanese female speaker

(subset of JSUT[5] corpus, 3,808/199 for training/evaluation)

STFT analysis Frame length: 400, shift length: 80, FFT length: 1024, analysis window: Hamming

Average pooling Zero padding: 6, pooling width �: {14, 30, 70}, stride: �

� and � 1.0

Dims. of / / 444 (linguistic feats, durations, Log F0, UV)/513/{74, 34, 14} (dim. of was changed in accordance with �)

DNN architectures Feed-Forward (see our paper)

DNN-based TTS w/ STFT spectra

[2]

(1) Generate raw STFT amplitude spectra.

(2) Reconstruct phase spectra by using

Griffin and Lim’s method

[3]

.

Pros.

avoiding a vocoding process

Cons.

over-smoothing of amplitude spectra

Linguistic

feats.

Natural

speech

params.

Generated

speech

params.

Acoustic

models

� ⋅

,

1:

natural

0: generated

Discriminative

models

� ⋅

S

, =

− ⊤ −

� ⋅

A V

1:

= �

natural

,

� ⋅

, = −

log �

log − �

,

(loss for

natural

params.)

(loss for

generated

,

params.)

G

, = �

S

, +

� � S

� �A V

A V

Minimizing

approx. JS-divergence

Assumption:

low-res. spectra

spectral envelopes

Average

pooling

Average

pooling

Natural

Generated

distribution difference

Minimizing

e

.g

.,

5

1

3

d

im

.

GAN-based TTS w/ low-res. STFT spectra

 

Linguistic

feats. & Log F0

Natural

spectral

amplitudes

Generated

spectral

amplitudes

Acoustic

models

� ⋅

A V

1:

natural

Low-res.

discriminative models

Average

pooling

S

,

G ow

, = �

S

, +

� �A VS

A V

GAN-based TTS w/ multi-res. STFT spectra

 

Approximately emulates the filter bank extraction

Subjective evaluation (preference AB tests)

MSE: minimizing

S

,

[2]

ADV: minimizing

G

,

[4]

ADV-Low: minimizing

G

ow

,

(proposed)

ADV-Multi: minimizing

G

l i

,

(proposed)

G l i

, = �

S

, +

� � S

� �A V

A V

+

�� ��A VS

A V

Results:

(1) ADV & ADV-

Multi didn’t improve speech quality.

Because of difficulty in minimizing

distribution differences in original resolution

(2) ADV-Low improved speech quality.

Effect of the spectral envelopes compensation

Natural MSE ADV

ADV-Low (proposed)

ADV-Multi (proposed)

Examples of 513-dimensional amplitude spectra

ADV-Low

(� = )

ADV-Low

(� = )

ADV-Low

(� = 7 )

MSE

: � < . : � > .

1. SYNOPSIS

In T

ext

-T

o

-S

peech

(TTS) synthesis w/ S

hort

-T

erm

F

ourier

T

ransform

(STFT) spectra:

(1) Proposes G

enerative

A

dversarial

N

etworks

(GANs)

[1]

-based training algorithm using low-/multi-resolution spectra.

参照

関連したドキュメント

An example of a length 4 highest weight category which is indecompos- able and Ringel self-dual, and whose standard modules are homogeneous, is the path algebra of the linear

Theorem 4.8 shows that the addition of the nonlocal term to local diffusion pro- duces similar early pattern results when compared to the pure local case considered in [33].. Lemma

In the next result, we show that for even longer sequences over C 6 3 without a zero-sum subsequence of length 6 we would obtain very precise structural results.. However, actually

The inverse problem associated to the Davenport constant for some finite abelian group is the problem of determining the structure of all minimal zero-sum sequences of maximal

In their turn, the singularity classes for special 2-flags are encoded by certain words over the alphabet {1, 2, 3} of length equal to flag’s length.. Both partitions exist in their

In this section, we present the transient queue length distribution at time t and a rela- tionship between the stationary queue length distributions at an arbitrary time and

I.R.M.A. — We introduce a hook length expansion technique and explain how to discover old and new hook length formulas for partitions and plane trees. The new hook length formulas

have found generalizations and other proofs of certain hook length formulas for plane