6 Text-to-Speech System
6.1 Mapping From Alphabet Character to Robot Parameters
Based on the observations of human voice generation as introduced in Section 1.2.3, the author constructed a full mapping of vowels and consonance sound corresponding to all the English alphabet characters. This step is significant in developing the TTS system for the talking robot. Vowel sounds are generated by the static articulation of the vocal apparatus. Consonant sounds, on the other hand, are vocalized by the dynamic motions of the apparatus. Therefore the generation of consonant sounds requires a dynamic control of the mechanical system. The vowels, liquids, and fricatives generation were previously introduced in section 3. In this chapter, the mechanism of the talking robot for nasal and plosive sounds is presented.
6.1.1 Nasal Sounds of Mechanical System
For the generation of the nasal sounds /n/ and /m/, the sliding valve is open to lead the air into the nasal cavity as shown in Figure 6.2. By closing the middle position of the vocal tract and then releasing the air to speak vowel sounds, the /n/ consonant is generated. For the /m/ consonant, the outlet part is closed to stop the air first, and then opened to vocalize vowels. The difference in the /n/ and /m/
consonants generation is basically the narrowing positions of the vocal tract. Sound spectra of nasal sounds are characterized as the first formant exists around 300 Hz, and spectrum power decreases sharply in the high-frequency range. The mechanical model's spectrum is similar to the human's and satisfies the characteristics of nasal sound.
Figure. 6.2: Mechanism of nasal sound
Sliding valve open
- 104 -
6.1.2 Plosive Sounds of Mechanical System
In generating the plosive sounds /p/ and /t/, the mechanical system closes the sliding valve so as not to release the air in the nasal cavity. By closing one point of the vocal tract, air provided from the lung is stopped and compressed in the tract as shown in Fig.6.3. Then the released air generates plosive consonant sounds like /p/ and /t/. For example, the plosive sound /pa/ is vocalized by combing the dynamic motions for the plosive sound /p/ and the vowel sound /a/. The sound wave of a plosive sound has impulse at the beginning, followed by the stable vowel sound waves.
Figure. 6.3: Mechanism of plosive sound
6.1.3 Robot Parameters
Robot vector X is presented as following
𝑋 =
(
𝑋1 𝑋2 ... 𝑥8 𝑥𝑡𝑔 𝑥𝑛𝑜 𝑥𝑢𝑛 𝑥𝑣𝑙 𝑥𝑝𝑡
𝑥𝑡
)
(6.1) Vocal tract parameters
Sliding valve close
- 105 -
One robot vector has 14 parameters as shown in equation 6.1.
The first eight parameters, x1 to x8, represent the vocal tract shape. The value of vocal tract shape parameters ranges from -1500 to 1500, which respectively represent for close and widest open of the cross-sectional areas of the vocal tract.
Parameter xtg represents the tongue position, and it has a value of up or down indicating the position of the tongue.
Parameter xno represents nasal cavity chamber control which has a value of open or closed.
Parameter xun is for unvoiced sound pitch control; its value is either yes or no indicating the air pathway to the nasal cavity chamber is either opened or closed.
Parameter xvl controls the sound volume, and it is the amount of the airflow, which has 5 levels of low, mid-low, mid, mid-high, and high, to the vocal cords.
Parameter xpt, which is manipulated by the vocal cords motor value as introduced in section 3.2.1, controls the fundamental frequency of the sound sources.
Parameter xt indicates the duration of vocalization for the talking robot. This value varies between vowels and consonants. For simplicity, the author set the duration to 500 milliseconds for vowels and 50 milliseconds for consonants.
The value of xvo and xpt are determined by intonation input by a user via GUI interface. The relationship between sound volume and fundamental frequency was presented in Figure 3.20 in section 3.21. For this TTS system, the author divides the intonation input into 9 levels. The tonal effect was previously set at middle tone at 5th level. If the user does not input the tonal effect after inputting the text, the intonation is kept at the 5th level.
The detail of nine intonation levels is shown in Table 6.1. The default intonation is highlighted with gray color. The robot parameters corresponding to alphabetic characters is shown in Table 6.2.
- 106 -
Table 6.1: Intonation effect parameters
Intonation level input Sound volume (xvo)
Pitch (xpt)
(Vocal cords motor angle)
1 Low -100
2 Mid-low -80
3 Mid-low -50
4 Mid -30
5 Mid 0
6 Mid-high 30
7 Mid-high 50
8 High 80
9 High 100
- 107 -
Table 6.2: Robot parameters corresponding to alphabetic characters
M1 M2 M3 M4 M5 M6 M7 M8 Tongue Nose Unvoiced Duration (ms)
a 366 -366 334 -301 73 513 -871 480 Down Close N 500
i -741 920 -920 855 -562 171 122 -155 Down Close N 500
u -920 513 -187 220 -611 708 -448 318 Down Close N 500
e -122 334 -562 594 -334 -155 285 -350 Down Close N 500
o -578 155 236 -269 -350 755 -755 366 Down Close N 500
b 370 -370 330 -300 70 510 -870 480 UP Close N 50
c -500 630 -300 450 -570 750 -770 920 UP Close N 50
d 70 -370 330 -300 70 510 -870 480 UP Close N 50
f -100 600 -700 200 -170 450 -770 920 Down Close Y 50
g -578 155 236 -269 -350 887 -855 366 Down Close N 50
h -370 -370 330 -300 70 510 -870 480 UP Close Y 50
j -500 630 -300 450 -570 750 -770 920 UP Close N 50
k -100 600 -700 200 -170 450 -770 920 Down Close N 50
l 0 0 0 0 0 0 0 0 UP Close N 50
m 1300 1000 -500 -20 -40 -70 180 -180 Down Open N 50
n -700 540 -750 800 -660 520 -400 180 UP Open N 50
p 370 -370 330 -300 70 510 -870 480 UP Close N 50
q -920 513 -187 220 -611 708 -448 318 Down Close N 50
r 0 0 0 0 0 0 0 0 UP Close N 50
s -500 630 -300 450 -570 750 -770 920 UP Close Y 50
t 370 -370 330 -300 70 510 -870 480 UP Close N 50
v -920 513 -187 220 -611 708 -448 318 Down Close Y 50
w -800 900 -900 950 -900 500 -300 0 UP Close N 50
x 0 0 0 0 0 0 0 0 UP Close N 50
y -920 513 -187 220 -611 708 -448 318 Down Close Y 50
z -578 155 236 -269 -350 887 -855 366 Down Close Y 50
Stop 0 0 0 0 0 0 0 0 Down Close N 500
In table 6.2, the color highlighted for each alphabet character indicates its sounding characteristic. The yellow color highlight is for vowels, no color highlight is for liquid sound, the blue color highlight with bold characters is for plosive sound, the green highlight is for nasal sound, and the gray color highlight is for fricative sound.