saito18icassp vc poster

(1)

•

[References]

• [1] Hsu et al.,

Proc. APSIPA ASC

, 2016. [2] Hojo et al.,

IEICE Trans. on Info. And Syst.

, 2017. [3] Dilokthanakul et al.,

arXiv:1611.02648,

2016. [4] Sun et al.,

Proc. ICME,

2016. [5] Variani et al.,

Proc. ICASSP,

2014. [6] Luong et al.,

Proc. ICASSP,

2017.

3. PROPOSED VAE-BASED VC

•

Estimating unknown speaker representation

•

(1) Speaker code adaptation

[6]

•

Using backprop. to adapt speaker codes

•

(2)

�

-vector averaging

•

Using averaged

�

-vec. in

voiced

region of

•

the speaker’s adaptation data

2. CONVENTIONAL VAE-BASED VC

[1]

•

Training VAEs for VC

•

Conversion using trained VAEs

•

Assumption: is independent of

�

•

⇒

Expected to represent phonetic contents.

•

Problems

•

(1) Converted speech quality is degraded

•

due to vanishing phonetic properties.

•

(2) It supports only one-to-one VC.

4. EXPERIMENTAL EVALUATION

•

Experimental conditions

•

Objective evaluation (Mel-Cepstral

Distortion)

•

Subjective evaluation (MOS/DMOS tests)

NON-PARALLEL VOICE CONVERSION USING VARIATIONAL AUTOENCODERS

CONDITIONED BY PHONETIC POSTERIORGRAMS AND D-VECTORS

Yuki Saito

_{, Yusuke Ijima , Kyosuke Nishida , and Shinnosuke Takamichi}

The University of Tokyo, Japan

NTT Corporation, Japan

©Yuki

Saito, 18/04/2018

ICASSP 2018 SP-P6.1



VC using VAEs w/ PPGs and

�

-vectors



(1) Phonetic contents are

given as PPGs

[4]

.



⇒

Phonetic contents can be restored!



(2) Discrete speaker codes are replaced with



continuous

�

-vectors

[5]

.



⇒ Any speakers’ characteristics can be



converted into any others!

Speech corpora

(a) For recognition/verification models: including 260 speakers

(130 male and 130 female speakers)

(b) For VC models: divided parallel 425 utterances

(2 male and 1 female speakers) Speech params.

(including Δ & ΔΔ) 40-dim. mel-cepstral coefficients, Log F0, 10-dim. bap DNN architecture Feed-Forward (see our paper)

Dim. of

PPGs/�-vec./ 56 (time variant)/16 (time invariant)/64

VC settings One-to-one: VC models are trained with corpora (b).

Many-to-many: VC models are trained with corpora (a).

ℒ �, �; ,

_s

= −�

_�

| || � ; �, � + �

_�_{� |}

log

_�

| ,

_s

Regularization term of

Reconstruction error of

Input

speech

params.

Generated

speech

params.

Encoder

�

|

Decoder

�

| ,

s

Latent

factors

� ; �, �

s

Speaker

codes

[2]

Encoder

Decoder

s

M

C

D

[

d

B

]

6.0 7.0 8.0 9.0

5 10 25 50 100 200

Number of training data

Worse

Better

(b) M2F

(a) M2MOne-to-one (a) M2MMany-to-many(b) M2F

5 10 25 50 100 200 5 10 25 50 100 200 5 10 25 50 100 200

Number of adaptation data

*DNN trained w/ parallel data

(a) Naturalness

MOS scores

1.0 2.0 3.0 4.0 5.0

(b) Speaker similarity

DMOS scores

1.0 2.0 3.0 4.0 5.0

Reference

Conventional

Proposed

(spkr. codes)

Proposed

(�-vec.)

Proposed

(spkr. codes)

Proposed

(�-vec.)

One -to-one

Many -to-many

Improved! Improved!

M2F M2F

M2M M2M

1. SYNOPSIS

•

Our approaches for non-parallel V

oice

C

onversion

(VC) using V

ariational

A

uto

E

ncoders

(VAEs):

•

(1) Introduce P

honetic

P

osterior

G

rams

(PPGs) for dealing with speech quality degradation.

•

(2) Extend conventional one-to-one VC to many-to-many VC (any speakers to any other speakers).

a

i

u

p

PPGs

[3]

Pre-trained

speech

recognition

�

-vec.

[4]

s

Pre-trained

speaker

verification

Latent factors of

phonetic contents

Latent factors of

speakers

⇒

Even unknown speaker can be

involved in VC process!

⇒

Adaptation

data

(M2M: Male-to-Male, M2F: Male-to-Female) Reference* Conventional Proposed_{(spkr. codes)} Proposed₍_�_-vec.)