Further Research Directions

speech, called “Limited Error Based Event Localizing Temporal Decomposition” (LEBEL-TD). In previous work with TD, TD analysis was usually performed on each speech seg-ment of about 200-300 ms or more, making it impractical for real-time applications. In this present work, the event localization is determined based on a limited error criterion and a local optimization strategy, which results in an average algorithmic delay of 75 ms. Simulation results showed that an average log spectral distortion of about 1.5 dB can be achievable at an event rate of 20 events/sec. Also, LEBEL-TD uses neither the computationally costly singular value decomposition routine nor the event reﬁnement pro-cess, thus reducing signiﬁcantly the computational cost of TD. Furthermore, a method of variable-rate speech coding at average rates around 1.8 kbps based on STRAIGHT using LEBEL-TD was also presented in the chapter. Subjective test results indicated that the speech quality obtained from the proposed speech coding method is comparable to that of the 4.8 kbps FS-1016 CELP coder. Also, the diﬀerences between LEBEL-TD and the conventional TD as well as the conventional interpolation method were pointed out.

In summary, we have proposed two eﬃcient algorithms for TD of LSF parameters, MRTD and LEBEL-TD, and investigated their applications in speech coding and speaker recognition. But it is more than that, their geometric interpretation as an eﬀective break-point analysis procedures gives a means of speech segmentation and speech recognition.

The fact that the event targets extracted by MRTD can convey speaker identity and the localized nature of each LSF provide necessary motivation to investigate the application of MRTD in voice conversion. More interestingly, using MRTD we can control spec-tral envelopes, durations, and fundamental frequencies independently and ﬂexibly, which suggests its potential applications in emotional speech, song synthesis, text-to-speech syn-thesis, etc. To prepare for future research towards this end, we have developed a voice transformation system based on the modiﬁcation of formants in the LSF domain.

Additionally, during this thesis investigation, we have also studied the S²BEL-TD al-gorithm and proposed improvements to this method. In Appendix A, a mathematical proof of the convergence property of the iterative reﬁnement procedure in the S²BEL-TD was presented. Also, some modiﬁcations to the original S²BEL-TD method which help accelerating the convergence of the iterative reﬁnement process were introduced. Exper-imental results showed that the modiﬁed S²BEL-TD gave a slightly better performance in terms of log spectral distortion over the original one while requiring fewer iterations.

2. Further work on speaker recognition: The event targets extracted from LSF parameters using the MRTD technique were found to be eﬀective when applied in VQ-based speaker identiﬁcation systems as a set of feature vectors. However, the identiﬁcation performance was evaluated on clean speech only. The use of event targets as a feature set in speaker veriﬁcation and other speaker identiﬁcation systems should also be investigated. Additionally, further experiments should be made in more demanding environments, such as noisy speech, speech at diﬀerent speaking rates, speech with emotion, and cross-language evaluation.

3. Development of a voice conversion system: A complete voice conversion sys-tem using TD of LSF parameters should be improved and further developed. The event targets were found to convey information about the identity of a speaker. It is of interest to investigate whether the exchange of event and excitation targets in some way would lead to a method of voice conversion.

4. Speech segmentation: The two proposed TD methods, MRTD and LEBEL-TD, should be experimentally proved to give better performance in terms of segmental relevance. This is because both MRTD and LEBEL-TD methods make use of a new determination of event functions that was claimed to describe the temporal structure of speech more eﬀectively in the interpretation of TD as a breakpoint analysis procedure. This fact, however, should be veriﬁed by experiments.

5. Speech recognition: There have been quite a few studies focusing on the applica-tion of temporal decomposiapplica-tion in speech recogniapplica-tion. Bimbot et al. [17] reported the results of a preliminary recognition experiment on a small corpus of continuously spelled French surnames. In the training phase, event targets are automatically ex-tracted and manually labelled. In the recognition phase, a lattice of the three best candidate phonemes is obtained and searched through, taking into account the lex-ical constraints of the French alphabet. They claimed a recognition score of 70%

on the letter level. Although this identiﬁcation score may not seem very high, it should be noted that unlike many other recognition approaches, the number of words to be recognized is not restricted. Niranjan and Fallside [106] suggested con-necting temporal decomposition with Hidden Markov Modeling (HMM). Kim [73]

built a simple word recognition utilizing the coded speech input. It is of interest to investigate whether MRTD and LEBEL-TD can be eﬀectively applied in speech recognition.

6. Incorporation of TD and ICA: The formulation of temporal decomposition is very similar to that of the Independent Component Analysis (ICA). It would certainly be interesting to investigate whether a combination of these two techniques would lead to better results.

7. Others: Temporal decomposition of LSF parameters has many advantages to be useful in voice transformation, speaker individuality, emotional speech, song synthesis, text-to-speech synthesis, etc. It would be interesting to investigate the usefulness of MRTD and LEBEL-TD in these applications.

Appendix A

Convergence Property of the

Iterative Reﬁnement Procedure in the S ² BEL-TD Method

The original method of temporal decomposition (TD) proposed by Atal [8] is known to have the two major drawbacks of high computational cost, and the high parameter sensitivity of the number and locations of events. Spectral Stability Based Event Local-izing Temporal Decomposition (S²BEL-TD) [103] has been proposed to overcome these drawbacks of Atal’s method. To achieve this end, S²BEL-TD implements TD in a mathe-matically simpler way, i.e. by avoiding singular value decomposition (SVD) routine, while adopting a maximum spectral stability criterion to determine the number and locations of the events, which avoids the necessity of redundant evaluation of event functions.

As already described in Section 3.3, S²BEL-TD determines the event targets, a_k, and event functions,φ_k(n), once the spectral parameters,y(n), of a speech segment are given.

This method is based on an assumption that each acoustic event that exists in speech gives rise to a spectrally stable point in its neighborhood. Therefore, the locations of the spectrally stable points and the corresponding spectral parameter vectors can be used as a good approximation to the event locations and event targets, respectively. Given these locations, the subsequent computation of reﬁned event targets and event functions is much less demanding than the traditional TD method. Also, this makes the number and locations of the events more parameter independent. Following the ﬁrst approximation of event targets, an iterative reﬁnement procedure is required to shape up the event functions and to reﬁne the event targets. This procedure aims at improving the reconstruction accuracy of TD results.

In S²BEL-TD, however, the convergence property of the iterative reﬁnement procedure has not been mathematically established. This appendix proposes a new criterion for the termination of iterations as well as a mathematical proof for the convergence of iterations in that procedure. Also, some modiﬁcations are made to the original S²BEL-TD method to improve the robustness of its iterative reﬁnement procedure in this respect. Experimental results conﬁrm that the S²BEL-TD method can work well with these modiﬁcations.

A.1 Iterative Reﬁnement Procedure in S

BEL-TD

As mentioned earlier, 4 to 5 iterations in general are required for the iterative reﬁnement procedure described in Section 3.3.3 to shape up the event functions. However, this is merely empirical and there is no evidence that the procedure can be terminated. In other words, we cannot ensure the convergence property for the iterative reﬁnement procedure adopted in the S²BEL-TD. In this section, we propose to make some modiﬁcations to the original S²BEL-TD method as follows. Firstly, the order of the reﬁnement process is subject to change, i.e., the reﬁnement of event targets is carried out before that of event functions. Secondly, the minor-lobes of those event functions which are considered for the recalculation of event targets are truncated before use in order to accelerate the convergence. Lastly, a new termination criterion is established. With these modiﬁcations, a mathematical proof for the convergence property of the iterative reﬁnement procedure adopted in S²BEL-TD has been realized. It is shown that the performance of the modiﬁed method is comparable to that of the original one while requiring fewer iterations.

ドキュメント内 JAIST Repository: A Study on Efficient Algorithms for Temporal Decomposition of Speech (ページ 133-136)

Appendix A

Convergence Property of the

Iterative Reﬁnement Procedure in the S 2 BEL-TD Method

A.1 Iterative Reﬁnement Procedure in S

BEL-TD

Iterative Reﬁnement Procedure in the S ² BEL-TD Method