Proposed method - Studies on Spectral Modification in Voice Transformation

4.3.1 Overview of our proposed method

Spectral dynamics

One of issues of spectral modification is discontinuities of spectral parameters if unex-pected modifications happen. One of efficient ways to solve the discontinuities of spec-tral parameters is control of specspec-tral dynamics. Specspec-tral dynamics is widely applied in many areas in speech signal processing, such as in speech recognition [124], speaker ver-ification [14, 21, 139], speech coding [100], text-to-speech [22, 62, 161], voice conversion [34, 146, 147].

In the viewpoint of engineering, speech is a signal continuously changing in a time.

Spectral dynamics, which refers to the temporal characteristics in spectral information, is a very important feature of speech. Spectral dynamics provides most information about phonetic properties of speech sounds (i.e. formant transitions). These correlations can be captured to some extent by augmenting the original set of acoustic features (static features) which dynamic features.

The dynamic features are often referred to as time derivatives or deltas. The simplest way to calculate spectral dynamics is computing the difference between the feature values of two consecutive frames.

∆yt =yt+1−yt (4.1)

wherey_tis the spectral feature of frame t, such as MFCC, LSF, LPC. We also can calculate the spectral dynamics within several frames as follows.

∆y_t=y_t+U−y_t−U (4.2)

where U typically takes a value of 1 or 2 (look forward and backward one or two frames).

Although time difference features have been used successfully in many systems, they are sensitive to random fluctuations in the original static features, and therefore tend to be “noisy”. In [124], a more robust measure of local change is obtained by applying linear regression over a sequence of frames as follows.

∆y_t= P_D

i=1i(yt+i−yt−i) 2P_D

i=1i² (4.3)

The delta features described in Eqs. (4.1), (4.2), and (4.3) are the first-order time derivatives. We can in turn calculate the second-order time derivatives ∆∆yt(referred to as delta-deltas) from the first-order time derivatives.

TD as a framework for modeling spectral dynamics

Studying on modeling of spectral dynamics has been attractive sciences and engineers in the area of speech signal processing. There are two compelling reasons for carrying out dynamic speech modeling. First, mathematical modeling of speech dynamics provides an effective tool in the scientific methods of studying the speech chain. Second, advancement of human language technology, especially which in automatic recognition of natural-style human speech is also expected to benefit from comprehensive computational modeling of speech dynamics.

In the literature, a hidden Markov model (HMM) is well-known for being a typical model for modeling spectral dynamics. A HMM can be used to represent a given speech segment in a stochastic manner. However, the HMM model requires a large database for modeling the spectral dynamics. This requirement is not suitable for this application, since we have a limited data.

In 1983, Atal proposed a method based on the temporal decomposition of speech into a sequence of overlapping event functions and corresponding event targets [5]. Temporal decomposition (TD) can be seen as a model of speech spectral evolution where a sequence of spectral parameters is described as a linear combination of a limited set of vectors (event targets). Event function between two event targets can be seen as interpolation of these event targets, and is a way to model transitions between successive sounds. In a TD process, we estimate event targets and event functions based on only this speech segment, and we do not need training data. Therefore, in this dissertation, we employ TD [5, 113] to control the spectral dynamics. In the following sections, we describe our proposed method based on TD to reduce the mismatch at concatenation points.

4.3.2 Proposed method

Since controlling spectral dynamics can improve the quality of concatenation speech, we propose a new method for concatenative speech synthesis based on temporal decomposi-tion [5, 113]. Our algorithm is described as follows, and is shown in Figure 4.1.

First, STRAIGHT [65] decomposes each speech segment into spectral envelopes, F0 (fundamental frequency) information, and aperiodic indices. Since the spectral envelopes can be further analyzed into LSF parameters, MRTD is employed in the next step to decompose the LSF parameters of each speech segment into event targets and event functions. The same event function evaluated for LSF parameters are used to decompose the fundamental frequency and gain to get fundamental frequency targets and gain targets.

In the ideal case, the last target of the first speech segment and the first target of second speech segment are identical. However, in concatenative speech synthesis, two event targets are often different. We need to modify these targets to smooth the transition between two speech segments. Since each event target is valid LSFs, we should modify event targets so that they become valid LSFs. In our algorithm, the modified event target is calculated by applying following equation.

LSF_i^{modif ied} =βLSF_i^{last ET} + (1−β)LSF_i^{f irst ET} (4.4)

STRAIGHT analysis

LSF analysis

TD analysis Spectral information

Speech unit 1

Speech unit 2

Modification of targets

STRAIGHT synthesis

LSF synthesis

TD synthesis Speech

signal

LSF & gain parameters

F0 AP

Targets of event, F0, gain

Event functions

Modified targets of event, F0, gain

Spectral information

LSF & gain parameters F0

Figure 4.1: Diagram of our proposed method.

where i = 1. . . P, P is the order of LSF. The LSF^{last ET} and LSF^{f irst ET} are the LSF parameters of the last event target of the first speech segment and the first event target of the second speech segment, respectively. β is the weight factor, and satisfies 0 ≤ β ≤ 1. We can adjust the value of β to control the degree of modification of each concatenation part in accordance with their importance. After combination of the last event target of the first speech segment and the first event target of the second speech segment, we also modify the fundamental frequency targets, and gain targets to smooth all of the most important parameters in the concatenation point. The modified event targets, modified fundamental frequency targets, and modified gain targets are then re-synthesized as modified LSFs, modified fundamental frequency information, and modified gain information by TD synthesis, respectively. In the next step, the modified LSF parameters and modified gain information are synthesized as spectral envelopes by LSF synthesis. Finally, STRAIGHT synthesis is employed to output the synthesized speech. Note that when we modify these targets, the spectral and source information of adjacent frames around on the concatenation point are also modified, and the smoothness is ensured by the shape of the event functions.

Figure 4.2: Results of subjective tests of concatenative speech synthesis.

ドキュメント内 Studies on Spectral Modification in Voice Transformation (ページ 86-89)