• 検索結果がありません。

saito18icassp vc poster

N/A
N/A
Protected

Academic year: 2018

シェア "saito18icassp vc poster"

Copied!
1
0
0

読み込み中.... (全文を見る)

全文

(1)

[References]

• [1] Hsu et al.,

Proc. APSIPA ASC

, 2016. [2] Hojo et al.,

IEICE Trans. on Info. And Syst.

, 2017. [3] Dilokthanakul et al.,

arXiv:1611.02648,

2016. [4] Sun et al.,

Proc. ICME,

2016. [5] Variani et al.,

Proc. ICASSP,

2014. [6] Luong et al.,

Proc. ICASSP,

2017.

3. PROPOSED VAE-BASED VC

Estimating unknown speaker representation

(1) Speaker code adaptation

[6]

Using backprop. to adapt speaker codes

(2)

-vector averaging

Using averaged

-vec. in

voiced

region of

the speaker’s adaptation data

2. CONVENTIONAL VAE-BASED VC

[1]

Training VAEs for VC

Conversion using trained VAEs

Assumption: is independent of

Expected to represent phonetic contents.

Problems

(1) Converted speech quality is degraded

due to vanishing phonetic properties.

(2) It supports only one-to-one VC.

4. EXPERIMENTAL EVALUATION

Experimental conditions

Objective evaluation (Mel-Cepstral

Distortion)

Subjective evaluation (MOS/DMOS tests)

NON-PARALLEL VOICE CONVERSION USING VARIATIONAL AUTOENCODERS

CONDITIONED BY PHONETIC POSTERIORGRAMS AND D-VECTORS

Yuki Saito

, Yusuke Ijima , Kyosuke Nishida , and Shinnosuke Takamichi

The University of Tokyo, Japan

NTT Corporation, Japan

©Yuki

Saito, 18/04/2018

ICASSP 2018 SP-P6.1

VC using VAEs w/ PPGs and

-vectors

(1) Phonetic contents are

given as PPGs

[4]

.

Phonetic contents can be restored!

(2) Discrete speaker codes are replaced with

continuous

-vectors

[5]

.

⇒ Any speakers’ characteristics can be

converted into any others!

Speech corpora

(a) For recognition/verification models: including 260 speakers

(130 male and 130 female speakers)

(b) For VC models: divided parallel 425 utterances

(2 male and 1 female speakers) Speech params.

(including Δ & ΔΔ) 40-dim. mel-cepstral coefficients, Log F0, 10-dim. bap DNN architecture Feed-Forward (see our paper)

Dim. of

PPGs/�-vec./ 56 (time variant)/16 (time invariant)/64

VC settings One-to-one: VC models are trained with corpora (b).

Many-to-many: VC models are trained with corpora (a).

ℒ �, �; ,

s

= −�

| || � ; �, � + �

� |

log

| ,

s

Regularization term of

Reconstruction error of

Input

speech

params.

Generated

speech

params.

Encoder

|

Decoder

| ,

s

Latent

factors

� ; �, �

s

Speaker

codes

[2]

Encoder

Decoder

s

M

C

D

[

d

B

]

6.0 7.0 8.0 9.0

5 10 25 50 100 200

Number of training data

Worse

Better

(b) M2F

(a) M2MOne-to-one (a) M2MMany-to-many(b) M2F

5 10 25 50 100 200 5 10 25 50 100 200 5 10 25 50 100 200

Number of adaptation data

*DNN trained w/ parallel data

(a) Naturalness

MOS scores

1.0 2.0 3.0 4.0 5.0

(b) Speaker similarity

DMOS scores

1.0 2.0 3.0 4.0 5.0

Reference

Conventional

Proposed

(spkr. codes)

Proposed

(�-vec.)

Proposed

(spkr. codes)

Proposed

(�-vec.)

One -to-one

Many -to-many

Improved! Improved!

M2F M2F

M2M M2M

1. SYNOPSIS

Our approaches for non-parallel V

oice

C

onversion

(VC) using V

ariational

A

uto

E

ncoders

(VAEs):

(1) Introduce P

honetic

P

osterior

G

rams

(PPGs) for dealing with speech quality degradation.

(2) Extend conventional one-to-one VC to many-to-many VC (any speakers to any other speakers).

a

i

u

p

PPGs

[3]

Pre-trained

speech

recognition

-vec.

[4]

s

Pre-trained

speaker

verification

Latent factors of

phonetic contents

Latent factors of

speakers

Even unknown speaker can be

involved in VC process!

Adaptation

data

(M2M: Male-to-Male, M2F: Male-to-Female) Reference* Conventional Proposed(spkr. codes) Proposed(-vec.)

Encoder

| ,

p

Decoder

参照

関連したドキュメント

Moreover, as applications of some results of this paper on generalized bi-quasi-variational inequalities, we shall obtain existence of solutions for some kind of minimization

As the variance ratio tests developed by Lo and MacKinlay [39] have been found to be more powerful than unit root tests, they are more often used by both academics and practitioners

New Bounds for Ternary Covering Arrays Using a Parallel Simulated Annealing.. Himer Avila-George, 1 Jose Torres-Jimenez, 2 and Vicente Hern

THIS PRODUCT IS LICENSED UNDER THE VC-1 PATENT PORTFOLIO LICENSE FOR THE PERSONAL AND NON-COMMERCIAL USE OF A CONSUMER TO (ⅰ) ENCODE VIDEO IN COMPLIANCE WITH THE VC-1

THEOREM 4.1 Let X be a non-empty convex subset of the locally convex Hausdorff topological vector space E, T an upper hemicontinuous mapping of X into 2 E’, T(x) is a non-empty

In this paper, we apply the modified variational iteration method MVIM, which is obtained by the elegant coupling of variational iteration method and the Adomian’s polynomials

These power functions will allow us to compare the use- fulness of the ANOVA and Kruskal-Wallis tests under various kinds and degrees of non-normality (combinations of the g and

We study parallel algorithms for addition of numbers having finite representation in a positional numeration system defined by a base β in C and a finite digit set A of