speech, called “Limited Error Based Event Localizing Temporal Decomposition” (LEBEL-TD). In previous work with TD, TD analysis was usually performed on each speech seg-ment of about 200-300 ms or more, making it impractical for real-time applications. In this present work, the event localization is determined based on a limited error criterion and a local optimization strategy, which results in an average algorithmic delay of 75 ms. Simulation results showed that an average log spectral distortion of about 1.5 dB can be achievable at an event rate of 20 events/sec. Also, LEBEL-TD uses neither the computationally costly singular value decomposition routine nor the event refinement pro-cess, thus reducing significantly the computational cost of TD. Furthermore, a method of variable-rate speech coding at average rates around 1.8 kbps based on STRAIGHT using LEBEL-TD was also presented in the chapter. Subjective test results indicated that the speech quality obtained from the proposed speech coding method is comparable to that of the 4.8 kbps FS-1016 CELP coder. Also, the differences between LEBEL-TD and the conventional TD as well as the conventional interpolation method were pointed out.
In summary, we have proposed two efficient algorithms for TD of LSF parameters, MRTD and LEBEL-TD, and investigated their applications in speech coding and speaker recognition. But it is more than that, their geometric interpretation as an effective break-point analysis procedures gives a means of speech segmentation and speech recognition.
The fact that the event targets extracted by MRTD can convey speaker identity and the localized nature of each LSF provide necessary motivation to investigate the application of MRTD in voice conversion. More interestingly, using MRTD we can control spec-tral envelopes, durations, and fundamental frequencies independently and flexibly, which suggests its potential applications in emotional speech, song synthesis, text-to-speech syn-thesis, etc. To prepare for future research towards this end, we have developed a voice transformation system based on the modification of formants in the LSF domain.
Additionally, during this thesis investigation, we have also studied the S2BEL-TD al-gorithm and proposed improvements to this method. In Appendix A, a mathematical proof of the convergence property of the iterative refinement procedure in the S2BEL-TD was presented. Also, some modifications to the original S2BEL-TD method which help accelerating the convergence of the iterative refinement process were introduced. Exper-imental results showed that the modified S2BEL-TD gave a slightly better performance in terms of log spectral distortion over the original one while requiring fewer iterations.
2. Further work on speaker recognition: The event targets extracted from LSF parameters using the MRTD technique were found to be effective when applied in VQ-based speaker identification systems as a set of feature vectors. However, the identification performance was evaluated on clean speech only. The use of event targets as a feature set in speaker verification and other speaker identification systems should also be investigated. Additionally, further experiments should be made in more demanding environments, such as noisy speech, speech at different speaking rates, speech with emotion, and cross-language evaluation.
3. Development of a voice conversion system: A complete voice conversion sys-tem using TD of LSF parameters should be improved and further developed. The event targets were found to convey information about the identity of a speaker. It is of interest to investigate whether the exchange of event and excitation targets in some way would lead to a method of voice conversion.
4. Speech segmentation: The two proposed TD methods, MRTD and LEBEL-TD, should be experimentally proved to give better performance in terms of segmental relevance. This is because both MRTD and LEBEL-TD methods make use of a new determination of event functions that was claimed to describe the temporal structure of speech more effectively in the interpretation of TD as a breakpoint analysis procedure. This fact, however, should be verified by experiments.
5. Speech recognition: There have been quite a few studies focusing on the applica-tion of temporal decomposiapplica-tion in speech recogniapplica-tion. Bimbot et al. [17] reported the results of a preliminary recognition experiment on a small corpus of continuously spelled French surnames. In the training phase, event targets are automatically ex-tracted and manually labelled. In the recognition phase, a lattice of the three best candidate phonemes is obtained and searched through, taking into account the lex-ical constraints of the French alphabet. They claimed a recognition score of 70%
on the letter level. Although this identification score may not seem very high, it should be noted that unlike many other recognition approaches, the number of words to be recognized is not restricted. Niranjan and Fallside [106] suggested con-necting temporal decomposition with Hidden Markov Modeling (HMM). Kim [73]
built a simple word recognition utilizing the coded speech input. It is of interest to investigate whether MRTD and LEBEL-TD can be effectively applied in speech recognition.
6. Incorporation of TD and ICA: The formulation of temporal decomposition is very similar to that of the Independent Component Analysis (ICA). It would certainly be interesting to investigate whether a combination of these two techniques would lead to better results.
7. Others: Temporal decomposition of LSF parameters has many advantages to be useful in voice transformation, speaker individuality, emotional speech, song synthesis, text-to-speech synthesis, etc. It would be interesting to investigate the usefulness of MRTD and LEBEL-TD in these applications.
Appendix A
Convergence Property of the
Iterative Refinement Procedure in the S 2 BEL-TD Method
The original method of temporal decomposition (TD) proposed by Atal [8] is known to have the two major drawbacks of high computational cost, and the high parameter sensitivity of the number and locations of events. Spectral Stability Based Event Local-izing Temporal Decomposition (S2BEL-TD) [103] has been proposed to overcome these drawbacks of Atal’s method. To achieve this end, S2BEL-TD implements TD in a mathe-matically simpler way, i.e. by avoiding singular value decomposition (SVD) routine, while adopting a maximum spectral stability criterion to determine the number and locations of the events, which avoids the necessity of redundant evaluation of event functions.
As already described in Section 3.3, S2BEL-TD determines the event targets, ak, and event functions,φk(n), once the spectral parameters,y(n), of a speech segment are given.
This method is based on an assumption that each acoustic event that exists in speech gives rise to a spectrally stable point in its neighborhood. Therefore, the locations of the spectrally stable points and the corresponding spectral parameter vectors can be used as a good approximation to the event locations and event targets, respectively. Given these locations, the subsequent computation of refined event targets and event functions is much less demanding than the traditional TD method. Also, this makes the number and locations of the events more parameter independent. Following the first approximation of event targets, an iterative refinement procedure is required to shape up the event functions and to refine the event targets. This procedure aims at improving the reconstruction accuracy of TD results.
In S2BEL-TD, however, the convergence property of the iterative refinement procedure has not been mathematically established. This appendix proposes a new criterion for the termination of iterations as well as a mathematical proof for the convergence of iterations in that procedure. Also, some modifications are made to the original S2BEL-TD method to improve the robustness of its iterative refinement procedure in this respect. Experimental results confirm that the S2BEL-TD method can work well with these modifications.
A.1 Iterative Refinement Procedure in S
2BEL-TD
As mentioned earlier, 4 to 5 iterations in general are required for the iterative refinement procedure described in Section 3.3.3 to shape up the event functions. However, this is merely empirical and there is no evidence that the procedure can be terminated. In other words, we cannot ensure the convergence property for the iterative refinement procedure adopted in the S2BEL-TD. In this section, we propose to make some modifications to the original S2BEL-TD method as follows. Firstly, the order of the refinement process is subject to change, i.e., the refinement of event targets is carried out before that of event functions. Secondly, the minor-lobes of those event functions which are considered for the recalculation of event targets are truncated before use in order to accelerate the convergence. Lastly, a new termination criterion is established. With these modifications, a mathematical proof for the convergence property of the iterative refinement procedure adopted in S2BEL-TD has been realized. It is shown that the performance of the modified method is comparable to that of the original one while requiring fewer iterations.