Speech-To-Face Movement Synthesis Based on HMMs

全文

(1)SPEECH-TO-FACE MOVEMENT SYNTHESIS BASED ON HMMS Kiyotsugu Kakihαra， S，αtoshi Nakαmura， Kíyohíro Shikαno Graduate School of Information Science， Nara Institute of Science & Technology 8916-5 Takayama， Ikoma， Nara 630-0101， JAPAN em羽1: {kiyotu-k，nakamura，shikano }@is.aist-nara.ac.jp. ABSTRACT. This paper describes a talking face generation system with natural and communicative reality. If face move ments 町e synthesized well enough for natural com munication， a lot of benefits will be brought for the hum叩-machine communication. We have already pro po関d a. speech driven HMM-b回ed lip movement syn thesis，回d also shown that the quality is dr剖tically im proved by considering succeeding visemes. This paper describes extension of our system to full fa.ce movement generation， and proposes a. method considering both of preceding_and succeeding viseme contexts. The exper iments slí�w the proposed method generates natural and a.ccuràte talking fa.ces from audio speech inputs. 1. I NTRODUCTION. 2.. SYSTEM O VERVIEW. A block dia.gram of the fa.cial im80ge synthesis from the input speech signals is illustrated in Fig. 1. This sys tem synthesizes facial images which synchronize with the inpu色 audio speech signa.ls， and outputs the syn chronous set of image and audio signa.ls. The mod ule (1) is a facial image and audio signal synchronous datab出e. The module ( 2) param�terizes inp凶 audio speech signals and the Ìnódule (3) maps in}:mt audio speech signals to facial parameters b剖ed on the audio HMMs. The module (5fo凶puts3-D fi邸ia.l ima.ges from the fa.cial parameters.. Synchronization. … 匝副 Spe舵h 時nal. 川山川ぬ叫わ印刷 P 回匂刷 M. This paper investigates a synthesis method for realizing human-like fa.ce movements by mapping from a.coustic speech signals to visual parameter sequence. The fa.ce movement synthesis from speech signals requires pre cise synchronization between input a.coustic speech sig nals and synthesized fa.cial images. The talking fa.ce synchronizing with audio speech signals is an indis pensable technique for hum姐・like visualized computer agents in interactive communica'もion. Mapping algorithm from audio speech signals to face movement sequence have been reported based on vec tor Qu姐tiza�i�n ( VQ ì[ ! ]， Ar�ificial �e，!!al Network[ 2]，組d lIidden Markov Mõdels[3， 4， 5]. We realize tliát the HMM・b副ed method hás an adva.ntage that ex plicit phonetic informa.tion is availa.ble to handle ccト articulation effects caused by surrounding phoneme con text. The other related works have been reported in [6， 7] from a.coustic， articulatory， and physiological points of view. Kuratate et al.[6] n:ported fa.cial iniage sYIl:t�esis system企om the range finder data. MorishimaげI re ported emotional fa.cial image synthesis based onFACS (Facial Action Coding System). However， those methods la.ck cosmetic details( eg.， teeth， eyes， hair) and natural movements， and precise syn chronization with input audio signals. The facial im ages will be precise by increasing number of 3・D mesh points. However， the increase of 3-D mesh points make di缶cult to estimate the m80pping from audio speech sig-. nals to fa.cial images. Thus this paper presents a new method which reduces control parameters of the ふD fa.ce by applying PCA( Principal Component Analysis) to the face surfa.ce position data.. Figure 1: Overview of the system. 3. FACE GENERATION FROM SPEECH 3.1. 3・D Face Model. The3・D face are modeled by deforming sphere NURBS curve primitive with 6400 points. One on the poles is located on the throat， and the other pole is located on the body. This model also have teethes， palate and to時間(Fig. 280， b). 3.2. Data Collection. Speech 80nd image data are synchronously recorded by the OPTOTRAK( Fig. 380). A fa.ci8ol imag巴is par8om-. 227.

(2) U is a unitary matrix whose column contains the eigen vectors of Cf. S is a diagonal matrix whose diagonal entries are the respective eigenvalues. U=[ Ul' U2，'" 3.4. a. Figure 2: Face model. a. Hardware texturing model. b. Wire frぉne model. eterized to 14 three dimensional positions(half of the fa.ce) on the orofa.cial locations. The position me出ures are recorded at 1∞Hz along with simultaneous record ing of the speech at 16 kHz. Since the inßuence of head motion on the face surface points is considerably large， the absolu切head position should be removed. Therefore) the head motion is also measured using 4 markers attached to the head(for marker positions， see Fig. 3b). This 4 standard marker positions enable to calculate the relative positions from the origin of the coordinate axis.. UK ] .. (4). Facial Parameter. Each eigenvalue of S denotes the variance accounted for the respective eigenvector and each eigenvector of U denotes the vector b出is of respective eigenvalues.. U is the principal components that can express any facial deform抗ions ßft by ßft =Uαt，. (5). where αt is the vector of principal component coeffi. cients calculated by. t αt = U ßft、. (6). where Ut corresponds to the facial deformation from the silent face and αt corresponds to weight for each facial deformations from the silent face. Therefore， each de formations ßft varies between αtmaz and αtmin・Since each ßft h剖two directions(plus， minus)， two deforma tion faces are needed to express the fa.ce corresponding to ft. Finally the first L column vectors of principal components with higher eigenvalues are used部facial parameters. 3.5. Face Synthesis. b.. Figure 3: a. OPTOTRAK. b. Marker position for recording head and facial motion. 3.3.. Analy sis. Face position data are recorded and reduced by apply ing PCA[6). Recorded fa.ce position data of each frame is expressed 出 a column vector f containing N x 3 entries， where N represents number of markers on the fa.ce and :1:， y and z values represent position of each 3・D node. F is a matrix consists of K frame f column vectors.. F=[h，h，"'fK).. (1). The “silent face"μf is defined as a face with closed mouth， and subtracted from each column of F gene子 ating 出e matrix of facial parameters from the silent face ßF.. ßF = [ßf} ， ßI2，'" ßfK) = F μf. - '. (2). The principal component of F can be calculated by ap plying singular value decomposition to the covariance matríx t yieldi吋 Cf=USUt. (3) Cf=ßF企F ，. 228. It is necessary to deform a face from出e me槌ured con trol data points. We apply the cluster deformation. The cluster deformation assigns one point to the clus ter of NURBS surface. The deform叫ion weights are determined according to the distance from the control point. Once cluster deformation is applied to the silent face， deformation faces can be generated by weighted linear combin叫ion of出e PCA faces which are composed of αtmaz叩dαtmin' Fig. 4 shows the PCA faces generated from speech and facial image synchronous datab田e composed of 216 phonetically balanced Japanese words. HMM-based Method for Facial Parameter Generation. 3.6.. The HMM-based method used in this paper is based mapping of an input audio parame七ers to fa.cial param eters through audio HMM states determined by the Viterbi alignment as a speech recognition. The Viterbi alignment assigns an input企ame to the optimal audio HMM state by maximizing a likelihood of input audio parameters. The HMM-Viterbi method is composed of two processesj a decoding process that converts input audio parameters to a most likely audio HMM states by the Viterbi alignment組d a table look-up process that converts an HMM states to corresponding facial parameters..

(3) 陥弓 (ω ①. ViC軒bi alignn問視. TaNc mat(:Nol. F篠訓imaau. Figure 4: P CA faces gener80ting from 216 phoneticaJly baJanced J80panese words. Fig. 5 ShOWB the procedure to construct 80 look-up t8oble， th80t is to train faci80l p町ameters for e80ch 8ou dio HMM st8ote. All training d80ta can be 組aJyzed to synchronous 8oudio-visuaJ par80meter sequences. As the audio parameter sequences c80n be converted to HMM state sequences， the synchronous faciaJ p80rameter se quences c80n .be s�f?IDented per �M� st8ote. FaciaJ pa: . s創ne _HMM state . rameters assigned in the 80re 8over8oged for gener80ting a representative facial parame旬r of出e HMM st8ote. The training and synthesis aJgorithms are 剖 follows; [ Training] 1. Train the acoustic phoneme HMMs using the train ing 8oudio parameters. 2. Align 8oudio parameter sequences of training dat80 into HMM state sequences by the forced Viterbi alignment using given仕組Bcriptions. 3. Take an average of the synchronous facial pa rameters for a11 frames associ8oted with the s80me HMM st80te for gener80ting a look-up t8oble. [ Synthesià] 1. Align an input audio parameter sequence in色o an HMM st80te sequence using the Viterbi aJignment. 2. Retrieve the output f8oci8ol p80rameters associ8oted with the HMM state using the look-up table of Cacial parameter per HMM state.. 3.7.. HMM-Viterbi Method with Context De. pendency. The not8oble errors 80re found 80t /h/， /80/ and silence word beginning(Fig. 6)， because of the face co凶gura tions of those pnonemès depend on preceding 80nd suc ceeding phonemes. To deal with the problem， the paper [4] proposed a HMM metho<i <:o_ns�d_ering a succeeding pnonemes， HMM-SV. The HMM-SV method continues to use context independent models but synthesize faciaJ parameters with context dependency. The method gen erates context dependent faciaJ parameters by looking. Figure 5: Viterbi aJignment(assign HMM state on each frame determini幻icaJly). ahead to the succeeding phonemes in an HMM state se quence. Instead of phoneme contexts，出e method uses clustered viseme contexts for computa.tion reduction. The aJgorithm to train facial parameters in the HMM SV method is basically similar to th80t of the HMM Viterbi method. The difference of the training aJgo rithms is出e way to t80ke 80n average of facial parame ters for each HMM states. The HMM-Viterbi method takes an 80verage of f80cial p80rameters in the same HMM state. The HMM-SV method obtains the average of facial parameters by the HMM sta.te and the viseme classes of succeeding phonemes. In this paper， a new method considering preceding and succeeding phonemes， HMM-PSV， is proposed. Visemes taken into account in this paper are those classified into 3 classes by merging facial parameters associ8oted 柄拘 the first and end sta.tes of 43 phoneme HMMs. Clus tering 80re conducted using the PCA1 facia.l parameter. The training algorithm of HMM-PSV method is simi larly derived taking aver80ge of facial parameters by the HMM st80te and the viseme classes of succeeding and preceding phonemes.. 4.. FACIAL MO VEMENT SYNTHESIS EXPERIMENTS. Speech and facia.l image dat80 are synchronously recorded in 100Hz.. A facial im80ge is parameterized 80t 14 po. sitions in three dimensiona.l axis on half of the face including five positions around lip outer contour. The facia.l parameters are represented by the principal com ponents. First five eigenvectors account for more than 99 % of variance observed in出e data. These parame ters are represented by the weights to PCAl+' PCA1_， " '， PCAs+， PCAs_. Speech signa.ls are parameterized into 12-order mel-cepstra.l coefficients， their delta coef ficients and delt80 log power. 16 component mixture Gaussian HMMs are trained for 41 phoneme， pre-word pause 叩d post-word pause. The audio and facial syn chronous database are composed of 216 phonetically baJanced Jap80nese word for training and another 100 word for testing.. 229.

(4) Figure 8: Facial images synthesized from HMM-PSV. 5.. EXPERIMENT RESULTS. The synthesized facial parame悦rs are evaluated by Eu clidean square error distance E. Experiment shows the average error distance by the proposed HMM-PSV is 2.01 lmmJ per each marker. Fig. 7 shows七he actual dift"erence"offa cial parameter PCA 1 for a training data /to m e zu/ whid i-incorrectly decoded 舗が's.ilB/・/sp/・. L�/・/r/・/i/・/�!-/h/・/a/・/e/・/z/・�u/-/u:I-/!!I�/silE/. Fig. 8 shows the synthesized facial images. The result shows the HMM-PSV achieves accurate generation of facial p町ameters inもhe region where 七he strong coa子 ticulation infl.uence e泊sts. 悩1温. ;.:.:.:.:.:.:.:ー置 0.0ヨ 1.. ...ーーー� ......・ .- ' . ' .I・ ‘ ー・，:， . _:rI''''' 0凶・ ... ... .. _"し'. U i 置 .. -0.0' � ・ : : : : . �: 巳{平� .. 一 �.1 � ".ご肘1-:':'ぺ膚 /r' -$v圃加血剖PCA1 . _ �主出按�.... ... �.15� _.. RU! 陀冶1 ・圃 __.. -..._..... _____ …. �.2'o. 20. 柑. ω. 80. 1∞. 1:拘. 1040. 4 J J4 �.. 1曲】鈎食‘met. Figure 6: Facial parameter PCAl synthesized by HMM-V method.. 。伺. p一一一..，. o. 20. ‘一ー...... j. コ"'-→' . ... �j :.:，.:. ・由 .1 、.....a:: . ::::::::::: : ::i Jl /� - Synlhuf凶PCAI J ・叫弘泌が."... ... ... Re副陀AI J .... .'.. . .. ..:. 1:1. . Oド・・......L.. �. �，M� �.I� ・0.151 �.21. lbI 181 ..:':' 田. ':，::: :: : {:::': I. 岨. ω. ω. 1∞. 120. 1040. 1ω. 180 rrarDlOf. Figure 7: Fa.cial parameter' PCA1 synthesized by HMM-PSV method. 6.. CO NCLUSION. We have described an HMM・b槌ed facial image syn thesis driven by input speech signals. We adopted the PCA祖d the morphing， which can represent any fa cial im乱ge 剖deformation of 3D NURBS curve surfaces.. 230. Then the face surface is clustered and controlled by the points，which are used by出e optical sensor in the mea surement. The mapping from the input speech signals is achieved by concatenation of an image parameter associated with the HMM state by Viterbi alignment to the speech signals. The generated facial images are precisely synchronized to input speech sign山. We are now trying to eva.luate not only error distance of con trol points but also that of the synthesized facial image.. 7.. REFERENCES. [1] Morishima， S. and Harashima，H.，“A Media. Con version from Speech to Facial Image for Intelli gent Man-Machine Interface"，IEEE Journal on seleçted. areas in Communica.tions， Vol 9， No. 4， pp. 594・600，1991. [21 Lavagetto，F.，“Converting Speech into Lip Move mentsヘA Multimedia Telephone or Hard of Hear ing People， IEEE Trans. on Reha.bilitation Engi neering，Vol. 3，No 1，pp. 90・102，1995. [31 Chou， W. and Chen，H.，“Speech RecogIlition for Ima.ge Animation and Coding". ICASSP'95， pp. 2253・2256，1995. [4J Yama.moto， E.， Na.kamura.， S.祖d Shikano， K.， “Speech-tcトLip Movement Synthesis by HMMヘ AVSP'97，pp. 137・140，1997. [5] Yamamoto， E.， NakamuraJ S. and Shikano， J:<:・7 “Speech-to・Lip Movement Synthesis b剖ed on EM Algori吐1m using Audio Visual HMMsヘInterna. tional Conference on Spoken Langua.ge Processing '98，Vo1.4，pp. 1275・1278，1998. [6] Kuratate，T.，Hani， Y. and Eric V.，“Kinema.tics Based Synthesis of Realistic Talking Faces"， AV SP'98，pp. 185-190， 1998. [7] Morishima， S.，“官Real-'τT引r口、'ime by Voice and Its Application to Communication and Ent悦er凶ta匂1n町lment" ，AVSP'98，pp. 195・199，1998..

(5)