Speech-To-Face Movement Synthesis Based on HMMs
全文
(2) U is a unitary matrix whose column contains the eigen vectors of Cf. S is a diagonal matrix whose diagonal entries are the respective eigenvalues. U=[ Ul' U2,'" 3.4. a. Figure 2: Face model. a. Hardware texturing model. b. Wire frぉne model. eterized to 14 three dimensional positions(half of the fa.ce) on the orofa.cial locations. The position me出ures are recorded at 1∞Hz along with simultaneous record ing of the speech at 16 kHz. Since the inßuence of head motion on the face surface points is considerably large, the absolu切head position should be removed. Therefore) the head motion is also measured using 4 markers attached to the head(for marker positions, see Fig. 3b). This 4 standard marker positions enable to calculate the relative positions from the origin of the coordinate axis.. UK ] .. (4). Facial Parameter. Each eigenvalue of S denotes the variance accounted for the respective eigenvector and each eigenvector of U denotes the vector b出is of respective eigenvalues.. U is the principal components that can express any facial deform抗ions ßft by ßft =Uαt,. (5). where αt is the vector of principal component coeffi. cients calculated by. t αt = U ßft、. (6). where Ut corresponds to the facial deformation from the silent face and αt corresponds to weight for each facial deformations from the silent face. Therefore, each de formations ßft varies between αtmaz and αtmin・Since each ßft h剖two directions(plus, minus), two deforma tion faces are needed to express the fa.ce corresponding to ft. Finally the first L column vectors of principal components with higher eigenvalues are used部facial parameters. 3.5. Face Synthesis. b.. Figure 3: a. OPTOTRAK. b. Marker position for recording head and facial motion. 3.3.. Analy sis. Face position data are recorded and reduced by apply ing PCA[6). Recorded fa.ce position data of each frame is expressed 出 a column vector f containing N x 3 entries, where N represents number of markers on the fa.ce and :1:, y and z values represent position of each 3・D node. F is a matrix consists of K frame f column vectors.. F=[h,h,"'fK).. (1). The “silent face"μf is defined as a face with closed mouth, and subtracted from each column of F gene子 ating 出e matrix of facial parameters from the silent face ßF.. ßF = [ßf} , ßI2,'" ßfK) = F μf. - '. (2). The principal component of F can be calculated by ap plying singular value decomposition to the covariance matríx t yieldi吋 Cf=USUt. (3) Cf=ßF企F ,. 228. It is necessary to deform a face from出e me槌ured con trol data points. We apply the cluster deformation. The cluster deformation assigns one point to the clus ter of NURBS surface. The deform叫ion weights are determined according to the distance from the control point. Once cluster deformation is applied to the silent face, deformation faces can be generated by weighted linear combin叫ion of出e PCA faces which are composed of αtmaz叩dαtmin' Fig. 4 shows the PCA faces generated from speech and facial image synchronous datab田e composed of 216 phonetically balanced Japanese words. HMM-based Method for Facial Parameter Generation. 3.6.. The HMM-based method used in this paper is based mapping of an input audio parame七ers to fa.cial param eters through audio HMM states determined by the Viterbi alignment as a speech recognition. The Viterbi alignment assigns an input企ame to the optimal audio HMM state by maximizing a likelihood of input audio parameters. The HMM-Viterbi method is composed of two processesj a decoding process that converts input audio parameters to a most likely audio HMM states by the Viterbi alignment組d a table look-up process that converts an HMM states to corresponding facial parameters..
(3) 陥 弓 (ω ①. ViC軒bi alignn問視. TaNc mat(:Nol. F篠訓imaau. Figure 4: P CA faces gener80ting from 216 phoneticaJly baJanced J80panese words. Fig. 5 ShOWB the procedure to construct 80 look-up t8oble, th80t is to train faci80l p町ameters for e80ch 8ou dio HMM st8ote. All training d80ta can be 組aJyzed to synchronous 8oudio-visuaJ par80meter sequences. As the audio parameter sequences c80n be converted to HMM state sequences, the synchronous faciaJ p80rameter se quences c80n .be s�f?IDented per �M� st8ote. FaciaJ pa: . s創ne _HMM state . rameters assigned in the 80re 8over8oged for gener80ting a representative facial parame旬r of出e HMM st8ote. The training and synthesis aJgorithms are 剖 follows; [ Training] 1. Train the acoustic phoneme HMMs using the train ing 8oudio parameters. 2. Align 8oudio parameter sequences of training dat80 into HMM state sequences by the forced Viterbi alignment using given仕組Bcriptions. 3. Take an average of the synchronous facial pa rameters for a11 frames associ8oted with the s80me HMM st80te for gener80ting a look-up t8oble. [ Synthesià] 1. Align an input audio parameter sequence in色o an HMM st80te sequence using the Viterbi aJignment. 2. Retrieve the output f8oci8ol p80rameters associ8oted with the HMM state using the look-up table of Cacial parameter per HMM state.. 3.7.. HMM-Viterbi Method with Context De. pendency. The not8oble errors 80re found 80t /h/, /80/ and silence word beginning(Fig. 6), because of the face co凶gura tions of those pnonemès depend on preceding 80nd suc ceeding phonemes. To deal with the problem, the paper [4] proposed a HMM metho<i <:o_ns�d_ering a succeeding pnonemes, HMM-SV. The HMM-SV method continues to use context independent models but synthesize faciaJ parameters with context dependency. The method gen erates context dependent faciaJ parameters by looking. Figure 5: Viterbi aJignment(assign HMM state on each frame determini幻icaJly). ahead to the succeeding phonemes in an HMM state se quence. Instead of phoneme contexts,出e method uses clustered viseme contexts for computa.tion reduction. The aJgorithm to train facial parameters in the HMM SV method is basically similar to th80t of the HMM Viterbi method. The difference of the training aJgo rithms is出e way to t80ke 80n average of facial parame ters for each HMM states. The HMM-Viterbi method takes an 80verage of f80cial p80rameters in the same HMM state. The HMM-SV method obtains the average of facial parameters by the HMM sta.te and the viseme classes of succeeding phonemes. In this paper, a new method considering preceding and succeeding phonemes, HMM-PSV, is proposed. Visemes taken into account in this paper are those classified into 3 classes by merging facial parameters associ8oted 柄拘 the first and end sta.tes of 43 phoneme HMMs. Clus tering 80re conducted using the PCA1 facia.l parameter. The training algorithm of HMM-PSV method is simi larly derived taking aver80ge of facial parameters by the HMM st80te and the viseme classes of succeeding and preceding phonemes.. 4.. FACIAL MO VEMENT SYNTHESIS EXPERIMENTS. Speech and facia.l image dat80 are synchronously recorded in 100Hz.. A facial im80ge is parameterized 80t 14 po. sitions in three dimensiona.l axis on half of the face including five positions around lip outer contour. The facia.l parameters are represented by the principal com ponents. First five eigenvectors account for more than 99 % of variance observed in出e data. These parame ters are represented by the weights to PCAl+' PCA1_, " ', PCAs+, PCAs_. Speech signa.ls are parameterized into 12-order mel-cepstra.l coefficients, their delta coef ficients and delt80 log power. 16 component mixture Gaussian HMMs are trained for 41 phoneme, pre-word pause 叩d post-word pause. The audio and facial syn chronous database are composed of 216 phonetically baJanced Jap80nese word for training and another 100 word for testing.. 229.
(4) Figure 8: Facial images synthesized from HMM-PSV. 5.. EXPERIMENT RESULTS. The synthesized facial parame悦rs are evaluated by Eu clidean square error distance E. Experiment shows the average error distance by the proposed HMM-PSV is 2.01 lmmJ per each marker. Fig. 7 shows七he actual dift"erence"offa cial parameter PCA 1 for a training data /to m e zu/ whid i-incorrectly decoded 舗が's.ilB/・/sp/・. L�/・/r/・/i/・/�!-/h/・/a/・/e/・/z/・�u/-/u:I-/!!I�/silE/. Fig. 8 shows the synthesized facial images. The result shows the HMM-PSV achieves accurate generation of facial p町ameters inもhe region where 七he strong coa子 ticulation infl.uence e泊sts. 悩1温. ;.:.:.:.:.:.:.:ー 置 0.0ヨ 1.. ...ーーー� ......・ .- ' . ' .I・ ‘ ー・,:, . _:rI''''' 0凶・ ... ... .. _"し'. U i 置 .. -0.0' � ・ : : : : . �: 巳{平� .. 一 �.1 � ".ご 肘1-:':'ぺ 膚 /r' -$v圃加血剖PCA1 . _ �主出按�.... ... �.15� _.. RU! 陀冶1 ・ 圃 __.. -..._..... _____ …. �.2'o. 20. 柑. ω. 80. 1∞. 1:拘. 1040. 4 J J4 �.. 1曲 】鈎 食‘met. Figure 6: Facial parameter PCAl synthesized by HMM-V method.. 。伺. p一一一..,. o. 20. ‘一ー...... j. コ"'-→' . ... �j :.:,.:. ・ 由 .1 、.....a:: . ::::::::::: : ::i Jl /� - Synlhuf凶PCAI J ・叫弘泌が."... ... ... Re副陀AI J .... .'.. . .. ..:. 1:1. . Oド・・......L.. �. �,M� �.I� ・0.151 �.21. lbI 181 ..:':' 田. ':,::: :: : {:::': I. 岨. ω. ω. 1∞. 120. 1040. 1ω. 180 rrarDlOf. Figure 7: Fa.cial parameter' PCA1 synthesized by HMM-PSV method. 6.. CO NCLUSION. We have described an HMM・b槌ed facial image syn thesis driven by input speech signals. We adopted the PCA祖d the morphing, which can represent any fa cial im乱ge 剖deformation of 3D NURBS curve surfaces.. 230. Then the face surface is clustered and controlled by the points,which are used by出e optical sensor in the mea surement. The mapping from the input speech signals is achieved by concatenation of an image parameter associated with the HMM state by Viterbi alignment to the speech signals. The generated facial images are precisely synchronized to input speech sign山. We are now trying to eva.luate not only error distance of con trol points but also that of the synthesized facial image.. 7.. REFERENCES. [1] Morishima, S. and Harashima,H.,“A Media. Con version from Speech to Facial Image for Intelli gent Man-Machine Interface",IEEE Journal on seleçted. areas in Communica.tions, Vol 9, No. 4, pp. 594・600,1991. [21 Lavagetto,F.,“Converting Speech into Lip Move mentsヘA Multimedia Telephone or Hard of Hear ing People, IEEE Trans. on Reha.bilitation Engi neering,Vol. 3,No 1,pp. 90・102,1995. [31 Chou, W. and Chen,H.,“Speech RecogIlition for Ima.ge Animation and Coding". ICASSP'95, pp. 2253・2256,1995. [4J Yama.moto, E., Na.kamura., S.祖d Shikano, K., “Speech-tcトLip Movement Synthesis by HMMヘ AVSP'97,pp. 137・140,1997. [5] Yamamoto, E., NakamuraJ S. and Shikano, J:<:・7 “Speech-to・Lip Movement Synthesis b剖ed on EM Algori吐1m using Audio Visual HMMsヘInterna. tional Conference on Spoken Langua.ge Processing '98,Vo1.4,pp. 1275・1278,1998. [6] Kuratate,T.,Hani, Y. and Eric V.,“Kinema.tics Based Synthesis of Realistic Talking Faces", AV SP'98,pp. 185-190, 1998. [7] Morishima, S.,“官Real-'τT引r口、'ime by Voice and Its Application to Communication and Ent悦er凶ta匂1n町lment" ,AVSP'98,pp. 195・199,1998..
(5)
図
関連したドキュメント
Furuta, Log majorization via an order preserving operator inequality, Linear Algebra Appl.. Furuta, Operator functions on chaotic order involving order preserving operator
The problem is modelled by the Stefan problem with a modified Gibbs-Thomson law, which includes the anisotropic mean curvature corresponding to a surface energy that depends on
The main purpose of this paper is to extend the characterizations of the second eigenvalue to the case treated in [29] by an abstract approach, based on techniques of metric
By an inverse problem we mean the problem of parameter identification, that means we try to determine some of the unknown values of the model parameters according to measurements in
These authors make the following objection to the classical Cahn-Hilliard theory: it does not seem to arise from an exact macroscopic description of microscopic models of
These authors make the following objection to the classical Cahn-Hilliard theory: it does not seem to arise from an exact macroscopic description of microscopic models of
The linearized parabolic problem is treated using maximal regular- ity in analytic semigroup theory, higher order elliptic a priori estimates and simultaneous continuity in
The purpose of this paper is to apply a new method, based on the envelope theory of the family of planes, to derive necessary and sufficient conditions for the partial