Lip Motion Generation from Audio Signals based on Hidden Markov Models

全文

(1)Lip Motion Generation from Audio Signals based on 1王idden Markov Models Satoshi Nakamura. Graduate School of Information Science， Nara Institute of Science and Technology 8916-5 Takayama， Ikoma， Nara， 630-0101 Japan E-mail: [email protected] Speech recognition or computer lipreading has been developed as a computer input. It is also important to provide a natural and friendly interface as an output. Recently， there has been increasing interest in using both auditory and visual modalities of speech processing. Especially， in research of human perception， the effect of integrating auditory and visual modalities has been investigated by tests of audio， visual， and audiかvisual speech intelligibility. Speech intelligibility in noisy environments is known to be improved by adding visual information about the face to acoustic signals degraded by noise[l]. Furthermore it is verified that the labial part gives the best contribution to the intelligibility score[2]. The results of these previous speech perception studies suggest that the lip movement synthesis can play a significant role in human-machine communication. This paper investigates the synthesis methods for realizing human-like lip and face move ments by mapping from speech. Speech includes richer information than text to synthesize real lip movements， such as pitch frequency and phoneme duration. Mapping algorithms from speech to lip movements have been reported based on: Vector Quantization[3]， Artificial Neural Networks[4， 5] and Gaussian Mixtures [8]. These methods are based on frame-by frame or frames-by-frames mapping from speech parameters to lip parameters. These mapping algorithms have problems， such as that 1) the mapping is fundamentally complicated many-tか many mapping， and 2) extensive training is required for taking account of context information. The required audiかvisual database increases in proportion to the length over the preceding or succeeding frames. On the other hand， there is another approach using techniques of speech recognition， such as phonetic segme凶atio叫9] and HMM[6， 7， 10， 11]. These methods convert speech into lip parameters based on the information such as a phonetic segment， a word， a phoneme， an acoustic event and so on. The HMM-based method has an advantage that ex plicit phonetic information is available to consider coarticulation effects caused by surrounding phoneme contexts. We have shown a mapping method based on the Viterbi decoding algorithm using audio HMMs(MAP-V method) is more efficient than the VQ method[l1]. However， the MAP-V method converts audio parameters to visual parameters through a deterministic single HMM state sequence. The deterministic process involves a substantial problem， which may give rise to incorrect visual parameters out of an incorrect HMM state sequence. For example， if bilabial consonant is decoded to other categories classified by place of articulation， the synthesized lip movement would generate a sense of incompatible among an audience. To solve the problem， we extend the MAP-V method to an un-deterministic process. This paper presents a new method to estimate visual parameters by applying the Expectation Maximization algorithm(MAP-EM). The MAP-EM method repeatedly estimates visual pa rameters while maximizing the likelihood of audio and visual joint probability of audio-visual HMMs[12]. The re-estimating operation is regarded as the auto-association of a complete pat tern out of an incomplete pattern for time series. In experiments， the MAP-EM method is compared to the MAP-V method. The first method is our baseline method， MAP-V[l1]， which is composed of two processes， such as a decoding process which converts a speech to a most likely HMM state sequence by the. 47.

(2) algorithm and a look-up table process which converts an HMM state to corresponding visual parameters. In the decoding process， the likelihood of an input speech by the k-th audio HMM， M，ιlS defined剖(1). T. P(OAIMt)勾rrすx{πql(Mt) x日αqt-lqt(Mt)bqt (oA(t)IMt)}，. (1). where Q = ql・. . qT denotes an audio HMM state sequence， OA =oA(l) . . . oA(T) is a sequence of input audio parameters， 7fj (Mt) is an initial state probability， αij(Mt) is a transition prob ability from state i to j of Mt， and bj (OA (t )j Mt) is an output probability at state j・ In the formula (1)， the optimal state sequence for the observation is obtained by a Viterbi alignment. Along the alignment， the correspondence between each audio HMM state and visual parameters is then calculated and stored in the look-up table in the training step. The visual parameters per an audio HMM state are obtained by taking average for all visual parameters assigned to the same audio HMM state. The quality of visual parameters by the MAP-V method depends on the accuracy of the Viterbi alignment. The incorrect HMM states by the Viterbi alignment may produce wrong lip images. The proposed MAP-EM method does not depends on the deterministic Viterbi alignment. The proposed method re-estimates visual parameters for the given audio parameters by the EM algorithm using audio-visual HMMs. Although the visual parameters do not exist initially， the required visual parameters are synthesized iteratively from initial parameters by re-estimation procedure maximizing the likelihood of audio-visual joint probability of audio visual HMMs.. 。V(t) =αr g T1}}ω p (OA， vjoA(t)， M: v )，。V. (t). (2). 山. where ôV (t) means an estimated visual parameters. The likelihood of the proposed method is derived by considering all HMM states at a time. To treat all states of all HMMs evenly， the likelihood of audio-visual joint probability is defined as following. T. 乞 P(MtV)p(OA，vIQ， Mtv) = 乞P(MtV)πql (Mtv)x rrαqtーlqt(Mtv)bqt(oA，v(t)IMtv)，(3). Q(α11 k). Q(α11 k). where P(MtV) is a probability of the k-th HM M，円 (MtV ) ， αij (MtV ) and bj(oA，v(t)jMtV) are the joint initial state probability， the joint transition probability and the joint output probability of audio and visual parameters， respectively. The summation of Q(αII k) considers all models MtV at a time. We carried out comparative evaluation of MAP-V and MAP-EM methods for lip movement synthesis using Euclidian distance. The results showed the MAP-EM method attains about 37% distance reduction compared to the MAP-V method at incorrectly decoded bilabial consonants. On the other hand， in the correctly decoded frames by the MAP-V method， the MAP-EM method slightly blurred the correct articulation of the MAP-V method. We are extending the MAP-EM method to full covariance matrix implementation in audio-visual joint probability HMMs. The full covariances of audio and visual parameters will give more natural synthetic visual parameters for an input speech.. 48.

(3) References [1] Sumbyぅ W. and Pollack， 1.: Visual Contribution to Speech Intelligibility in Noise， Journαl 01 the Acoωtic Society 01 Americ仏Vol.26， pp. 212-215 (1954) [2] Summerfield， Q.: Use of Visual Information for Phonetic Perception， Phoneticα， Vol. 36， pp. 314-331 (1979). [3] Morishima， S.， Aizawa， K. and Harashima， H.: An Intelligent Facial Image Coding Driven by Speech and Phoneme， ICASSP 89， pp. 1795-1798 (1989). [4] Morishima， S. and Harashima， H.: A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface， IEEE Journαl on sel.αreαs in CommunicationsぅVol.9， No. 4， pp. 594-600 (1991). [5] Lavagetto， F.: Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People， IEEE Tnαns. on Rehαbilitαtion Engineering， Vol. 3， No. 1， pp. 90-102 (1995) [6] Simons， A. and Cox， S.: Generation of Mouthshape for a Synthetic Talking Head， Proc.01 the Institute 01 Acoωtics， Vol. 12， No. 10 (1990). [7J Chou， W. and Chen， H.: Speech Recognition for Image Animation and Coding， ICASSP 95， pp. 2253-2256 (1995). [8J Rao， R.R. and Chen，T.， "Cross-Modal Prediction in AudiかVisual CommunicationぺProc IEEE International Conference on Acoustics， Speech and Signal Processing， Vo1.4， pp.20562059(1996) [9J Goldenthal川T.， Waters，K.， Van Thong，J.M. and Glickman，O. "Driving Synthetic Mouth Gestures: Phonetic Recognition for FaceMe!"， Eurospeech'97 Proceedings，Vo1.4， pp.19951998 (1997) [10J Chen， T. and Rao， R.: AudiかVisual Interaction in Multimedia Communication， ICASSP 97， pp. 179-182 (1997). [l1J Yamamoto， E.， Nakamura， S. and Shikano， K.: Speech-to-Lip Movement Synthesis by HMM， ESCA Workshop 01 Audio V削 [12司JN向ak王amu町lra札ヲ S.， Yamamotω0， E.， Shi註ika也an∞0， K.: Speech-to-Lip Movement Synthesis Maxi mizing AudiかVisual Joint Probability Based on EM Algorithm， IEEE Multimedia Signal Processing WorkshoP! MMSP98， (1998). 49.

(4)