Strategy for extracting the phoneme boundaries from all candidatesfrom all candidates

6.1.1 An outline for the strategy

From the discussion in Section 3.3, there is an idea for the strategy, to treat the labelling files obtained by other automatic segmentation methods as references just as the manual labelling file used in the experiments presented in Chapter 3. There must be no manual labelling files as references when automatic speech segmentation is carried out for real speech by the proposed method. However, continuous speech in English and in Japanese can be separately segmented and labelled by HTK and Julious based on HMM. Besides, there has been other software based on HMM for different languages. Thus, the labelling files obtained by HMM-based automatic segmentation methods can be used. Comparing HMM-based methods and the proposed method, for the HMM-based segmentation and labelling, the number of boundaries can be determined, however, the boundaries are not accurate enough compared to hand-made boundaries, on the opposite, there are always

candidates close to the hand-made boundaries by the spectrum target prediction model, however, there have been no suitable rules for choosing these accurate automatic segmen-tation boundaries from a number of candidates directly. Thus, to adjust the HMM-based boundaries to the accurate boundaries obtained by the spectrum target prediction model is a possible strategy for extracting the phoneme boundaries from all candidates.

In addition, according to the discussion for the advantage of the spectrum target predic-tion model in Secpredic-tion 5.3, there are always peaks near the manual boundaries, as shown in Fig. 5.2. Thus, if the estimation for the boundaries in human perception is feasible using the existing labelling files by the HMM-based method, it is foreseeable to obtain more precise boundaries compared to hand-made ones.

6.1.2 Implementation of the strategy

To implement this strategy, the labelling files obtained by the HMM-based method are needed. Fig. 6.1 gives an example for the labelling files, which is an revised output for time-alignment by HTK. There are three columns of data for each row in this file. The first number is the start time of a phoneme and the second number is the end time of this phoneme. The unit for two numbers is 1 millisecond for the proposed method. The third data is the phoneme.

In original labelling files by HTK, the unit for the time is 100 nanosecond, which should be divided by 10000 to be converted to 1 millisecond. Besides, in each row, there is a additional data, which is the same as the third data in the sample labelling file. It is used to identify the word sequence^[17] and not used in this research.

Figure 6.1: A sample labelling file for HMM-based segmentation.

To estimate the boundaries in human perception, one more file is needed, as shown Fig 6.2. In this file, there are five columns of data in each row. The former three numbers are the mean errors, maximum errors and minimum errors for a boundary separately. The rest data records the phoneme before the boundary and the phoneme after the boundary separately. It can also be obtained statistically by calculating the errors between all automatic segmentation boundaries by the HMM-based method and the corresponding manual boundaries. The revised manual labelling file corresponding the HMM-based labelling file from TIMIT database in Fig. 6.1 is shown in Fig. 6.3. The similar as the labelling file in Fig. 6.1, in each row, the first number is the start time of a phoneme, the

second number is the end time of this phoneme and the third data is the phoneme in the same order as the HMM-based labelling file.

The similar as the original labelling files by HTK, in the original manual labelling files in TIMIT database, the time in the phonetic transcriptions are recorded at the sampling points, which should be divided by 16 to convert the unit to 1 millisecond.

Figure 6.2: A list to describe the mean errors for each boundary.

Figure 6.3: A sample labelling file for hand-made segmentation in human perception.

Utilizing the HMM-based labelling files and the corresponding manual labelling files, there are some steps to obtain the errors list file in Fig. 6.2. As shown in Fig. 6.4, firstly, all errors are calculated between HMM-based boundaries and the corresponding boundaries in human perception, while recording the phonemes before the boundaries and the phonemes after the boundaries. Then, the numbers and the total errors for the same boundaries (the phonemes before the boundaries and the phonemes after the boundaries

are the same) can be calculated statistically based on the results in the first step. Finally, the mean errors for each boundary can be calculated out, meanwhile, the maximum errors and the minimum errors can be obtained..

Figure 6.4: Flow-process diagram for obtaining the error list file.

Based on the list file for errors, a block diagram implementing this strategy for each speech signal is shown in Fig. 6.5. The speech signal and the HMM-based labelling file are the input, and the labelling file by the proposed method is the output. When a HMM-based boundary is provided to the proposed method, firstly the mean error should

be found for this boundary by checking in the errors list file according to the former phoneme and the later phoneme, and then the absolute value of the mean error will be judged if it is less than 20 milliseconds. If the answer is “Yes”, a estimated manual boundary is determined by moving the HMM-based boundaries according to the found error and the nearest peak will be selected to be the boundary by the proposed method.

Otherwise, a range [hmm boundary−max error, hmm boundary −min error] for esti-mating manual boundary will be used for selecting the peak with biggest value as the automatic segmentation boundary, where hmm boundary is the HMM-based boundary, meanwhile, max errorandmin errorare the maximum error and the minimum error for this boundary compared to the manual boundary checked from the errors list file.

The reason compared to 20 milliseconds is that the accuracy of automatic segmentation is generally measured in terms of what percentage of the automatically labelled bound-aries are within a given time threshold (tolerance) of the manually labelled boundbound-aries.

20 milliseconds has been most widely used as a tolerance for measuring phone segmen-tation quality^[22]. If the absolute error between automatic segmentation boundaries and manual segmentation boundaries is less than 20 milliseconds, the automatic segmentation boundaries will be treated as the suitable boundaries.

ドキュメント内 JAIST Repository: A Study on Automatic Speech Segmentation Method Based on Human Perception Characteristics (ページ 30-34)