4.3.1 Overview of our proposed method
Spectral dynamics
One of issues of spectral modification is discontinuities of spectral parameters if unex-pected modifications happen. One of efficient ways to solve the discontinuities of spec-tral parameters is control of specspec-tral dynamics. Specspec-tral dynamics is widely applied in many areas in speech signal processing, such as in speech recognition [124], speaker ver-ification [14, 21, 139], speech coding [100], text-to-speech [22, 62, 161], voice conversion [34, 146, 147].
In the viewpoint of engineering, speech is a signal continuously changing in a time.
Spectral dynamics, which refers to the temporal characteristics in spectral information, is a very important feature of speech. Spectral dynamics provides most information about phonetic properties of speech sounds (i.e. formant transitions). These correlations can be captured to some extent by augmenting the original set of acoustic features (static features) which dynamic features.
The dynamic features are often referred to as time derivatives or deltas. The simplest way to calculate spectral dynamics is computing the difference between the feature values of two consecutive frames.
∆yt =yt+1−yt (4.1)
whereytis the spectral feature of frame t, such as MFCC, LSF, LPC. We also can calculate the spectral dynamics within several frames as follows.
∆yt=yt+U−yt−U (4.2)
where U typically takes a value of 1 or 2 (look forward and backward one or two frames).
Although time difference features have been used successfully in many systems, they are sensitive to random fluctuations in the original static features, and therefore tend to be “noisy”. In [124], a more robust measure of local change is obtained by applying linear regression over a sequence of frames as follows.
∆yt= PD
i=1i(yt+i−yt−i) 2PD
i=1i2 (4.3)
The delta features described in Eqs. (4.1), (4.2), and (4.3) are the first-order time derivatives. We can in turn calculate the second-order time derivatives ∆∆yt(referred to as delta-deltas) from the first-order time derivatives.
TD as a framework for modeling spectral dynamics
Studying on modeling of spectral dynamics has been attractive sciences and engineers in the area of speech signal processing. There are two compelling reasons for carrying out dynamic speech modeling. First, mathematical modeling of speech dynamics provides an effective tool in the scientific methods of studying the speech chain. Second, advancement of human language technology, especially which in automatic recognition of natural-style human speech is also expected to benefit from comprehensive computational modeling of speech dynamics.
In the literature, a hidden Markov model (HMM) is well-known for being a typical model for modeling spectral dynamics. A HMM can be used to represent a given speech segment in a stochastic manner. However, the HMM model requires a large database for modeling the spectral dynamics. This requirement is not suitable for this application, since we have a limited data.
In 1983, Atal proposed a method based on the temporal decomposition of speech into a sequence of overlapping event functions and corresponding event targets [5]. Temporal decomposition (TD) can be seen as a model of speech spectral evolution where a sequence of spectral parameters is described as a linear combination of a limited set of vectors (event targets). Event function between two event targets can be seen as interpolation of these event targets, and is a way to model transitions between successive sounds. In a TD process, we estimate event targets and event functions based on only this speech segment, and we do not need training data. Therefore, in this dissertation, we employ TD [5, 113] to control the spectral dynamics. In the following sections, we describe our proposed method based on TD to reduce the mismatch at concatenation points.
4.3.2 Proposed method
Since controlling spectral dynamics can improve the quality of concatenation speech, we propose a new method for concatenative speech synthesis based on temporal decomposi-tion [5, 113]. Our algorithm is described as follows, and is shown in Figure 4.1.
First, STRAIGHT [65] decomposes each speech segment into spectral envelopes, F0 (fundamental frequency) information, and aperiodic indices. Since the spectral envelopes can be further analyzed into LSF parameters, MRTD is employed in the next step to decompose the LSF parameters of each speech segment into event targets and event functions. The same event function evaluated for LSF parameters are used to decompose the fundamental frequency and gain to get fundamental frequency targets and gain targets.
In the ideal case, the last target of the first speech segment and the first target of second speech segment are identical. However, in concatenative speech synthesis, two event targets are often different. We need to modify these targets to smooth the transition between two speech segments. Since each event target is valid LSFs, we should modify event targets so that they become valid LSFs. In our algorithm, the modified event target is calculated by applying following equation.
LSFimodif ied =βLSFilast ET + (1−β)LSFif irst ET (4.4)
STRAIGHT analysis
LSF analysis
TD analysis Spectral information
Speech unit 1
Speech unit 2
Modification of targets
STRAIGHT synthesis
LSF synthesis
TD synthesis Speech
signal
LSF & gain parameters
F0 AP
Targets of event, F0, gain
Event functions
Modified targets of event, F0, gain
Spectral information
LSF & gain parameters F0
Figure 4.1: Diagram of our proposed method.
where i = 1. . . P, P is the order of LSF. The LSFlast ET and LSFf irst ET are the LSF parameters of the last event target of the first speech segment and the first event target of the second speech segment, respectively. β is the weight factor, and satisfies 0 ≤ β ≤ 1. We can adjust the value of β to control the degree of modification of each concatenation part in accordance with their importance. After combination of the last event target of the first speech segment and the first event target of the second speech segment, we also modify the fundamental frequency targets, and gain targets to smooth all of the most important parameters in the concatenation point. The modified event targets, modified fundamental frequency targets, and modified gain targets are then re-synthesized as modified LSFs, modified fundamental frequency information, and modified gain information by TD synthesis, respectively. In the next step, the modified LSF parameters and modified gain information are synthesized as spectral envelopes by LSF synthesis. Finally, STRAIGHT synthesis is employed to output the synthesized speech. Note that when we modify these targets, the spectral and source information of adjacent frames around on the concatenation point are also modified, and the smoothness is ensured by the shape of the event functions.
Figure 4.2: Results of subjective tests of concatenative speech synthesis.