Lip Motion Generation from Audio Signals based on Hidden Markov Models
全文
(2) algorithm and a look-up table process which converts an HMM state to corresponding visual parameters. In the decoding process, the likelihood of an input speech by the k-th audio HMM, M,ιlS defined剖(1). T. P(OAIMt)勾rrすx{πql(Mt) x日αqt-lqt(Mt)bqt (oA(t)IMt)},. (1). where Q = ql・. . qT denotes an audio HMM state sequence, OA =oA(l) . . . oA(T) is a sequence of input audio parameters, 7fj (Mt) is an initial state probability, αij(Mt) is a transition prob ability from state i to j of Mt, and bj (OA (t )j Mt) is an output probability at state j・ In the formula (1), the optimal state sequence for the observation is obtained by a Viterbi alignment. Along the alignment, the correspondence between each audio HMM state and visual parameters is then calculated and stored in the look-up table in the training step. The visual parameters per an audio HMM state are obtained by taking average for all visual parameters assigned to the same audio HMM state. The quality of visual parameters by the MAP-V method depends on the accuracy of the Viterbi alignment. The incorrect HMM states by the Viterbi alignment may produce wrong lip images. The proposed MAP-EM method does not depends on the deterministic Viterbi alignment. The proposed method re-estimates visual parameters for the given audio parameters by the EM algorithm using audio-visual HMMs. Although the visual parameters do not exist initially, the required visual parameters are synthesized iteratively from initial parameters by re-estimation procedure maximizing the likelihood of audio-visual joint probability of audio visual HMMs.. 。V(t) =αr g T1}}ω p (OA, vjoA(t), M: v ), 。V. (t). (2). 山. where ôV (t) means an estimated visual parameters. The likelihood of the proposed method is derived by considering all HMM states at a time. To treat all states of all HMMs evenly, the likelihood of audio-visual joint probability is defined as following. T. 乞 P(MtV)p(OA,vIQ, Mtv) = 乞P(MtV)πql (Mtv)x rrαqtーlqt(Mtv)bqt(oA,v(t)IMtv),(3). Q(α11 k). Q(α11 k). where P(MtV) is a probability of the k-th HM M, 円 (MtV ) , αij (MtV ) and bj(oA,v(t)jMtV) are the joint initial state probability, the joint transition probability and the joint output probability of audio and visual parameters, respectively. The summation of Q(αII k) considers all models MtV at a time. We carried out comparative evaluation of MAP-V and MAP-EM methods for lip movement synthesis using Euclidian distance. The results showed the MAP-EM method attains about 37% distance reduction compared to the MAP-V method at incorrectly decoded bilabial consonants. On the other hand, in the correctly decoded frames by the MAP-V method, the MAP-EM method slightly blurred the correct articulation of the MAP-V method. We are extending the MAP-EM method to full covariance matrix implementation in audio-visual joint probability HMMs. The full covariances of audio and visual parameters will give more natural synthetic visual parameters for an input speech.. 48.
(3) References [1] Sumbyぅ W. and Pollack, 1.: Visual Contribution to Speech Intelligibility in Noise, Journαl 01 the Acoωtic Society 01 Americ仏Vol.26, pp. 212-215 (1954) [2] Summerfield, Q.: Use of Visual Information for Phonetic Perception, Phoneticα, Vol. 36, pp. 314-331 (1979). [3] Morishima, S., Aizawa, K. and Harashima, H.: An Intelligent Facial Image Coding Driven by Speech and Phoneme, ICASSP 89, pp. 1795-1798 (1989). [4] Morishima, S. and Harashima, H.: A Media Conversion from Speech to Facial Image for Intelligent Man-Machine Interface, IEEE Journαl on sel.αreαs in CommunicationsぅVol.9, No. 4, pp. 594-600 (1991). [5] Lavagetto, F.: Converting Speech into Lip Movements: A Multimedia Telephone for Hard of Hearing People, IEEE Tnαns. on Rehαbilitαtion Engineering, Vol. 3, No. 1, pp. 90-102 (1995) [6] Simons, A. and Cox, S.: Generation of Mouthshape for a Synthetic Talking Head, Proc.01 the Institute 01 Acoωtics, Vol. 12, No. 10 (1990). [7J Chou, W. and Chen, H.: Speech Recognition for Image Animation and Coding, ICASSP 95, pp. 2253-2256 (1995). [8J Rao, R.R. and Chen,T., "Cross-Modal Prediction in AudiかVisual CommunicationぺProc IEEE International Conference on Acoustics, Speech and Signal Processing, Vo1.4, pp.20562059(1996) [9J Goldenthal川T., Waters,K., Van Thong,J.M. and Glickman,O. "Driving Synthetic Mouth Gestures: Phonetic Recognition for FaceMe!", Eurospeech'97 Proceedings,Vo1.4, pp.19951998 (1997) [10J Chen, T. and Rao, R.: AudiかVisual Interaction in Multimedia Communication, ICASSP 97, pp. 179-182 (1997). [l1J Yamamoto, E., Nakamura, S. and Shikano, K.: Speech-to-Lip Movement Synthesis by HMM, ESCA Workshop 01 Audio V削 [12司JN向ak王amu町lra札ヲ S., Yamamotω0, E., Shi註ika也an∞0, K.: Speech-to-Lip Movement Synthesis Maxi mizing AudiかVisual Joint Probability Based on EM Algorithm, IEEE Multimedia Signal Processing WorkshoP! MMSP98, (1998). 49.
(4)
関連したドキュメント
The goal of this work is to study the performance of the estimates produced by the EM algorithm, taking into account the method of moments and a random initialization method to
Using a projection approach, we obtain an asymptotic information bound for estimates of parameters in general regression models under choice-based and two-phase outcome-
Considering singular terms at 0 and permitting p 6= 2, Loc and Schmitt [17] used the lower and upper solution method to show existence of solution for (1.1) with the nonlinearity of
By correcting these mistakes, we find that parameters of the spherical function are rational with respect to parameters of the (generalized principal series) representation.. As
A wave bifurcation is a supercritical Hopf bifurcation from a stable steady constant solution to a stable periodic and nonconstant solution.. The bifurcating solution in the case
Based on the models of urban density, two kinds of fractal dimensions of urban form can be evaluated with the scaling relations between the wave number and the spectral density.. One
The focus has been on some of the connections between recent work on general state space Markov chains and results from mixing processes and the implica- tions for Markov chain
The first result concerning a lower bound for the nth prime number is due to Rosser [15, Theorem 1].. He showed that the inequality (1.3) holds for every positive