Method - 東北大学機関リポジトリTOUR

I use u ∈ {1,· · · ,U} to index users, i_A ∈ {1,· · · ,I_A} to index items belonging to domain A, andi_B ∈ {_1,· · · _,I_B} to index items belonging to domain B. In this work, I consider learning implicit feedback. The user-by-item interaction matrix is the click¹ matrixX ∈ _N^U^×^I. The lower case x_u = [x_u1,x_u2,· · · ,x_uI]^T ∈ _N^I is a bag-of-words vector, which is the number of clicks for each item from useru. With two domains, I have matrixX_A∈ _N^U^×^I^A withx_A= [x_A1,x_A2,· · · ,x_AI_A]^T ∈ _N^I^A for domain A, andX_B ∈_N^U^×^I^B withx_B = [x_B1,x_B2,· · · ,x_BI_B]^T ∈_N^I^B for domain B. For simplicity, I binarize the click matrix, meaning thatx_ui =_{1 if user}uhas click on item iandx_ui =0 otherwise. Also, 0 can be regarded as missing values inX, and can be generated through my framework. It is straightforward to extend its use to general count data.

1I use the verb "click" for concreteness. In fact, this can be any type of interaction such as "watch",

"view," or "rating."

5.3. Method 45 5.3.1 Framework

My framework, as presented in Figure 5.1, is based on variational autoencoder (VAE) [24] and generative adversarial network (GAN) [14]. In my model, VAE models have the main responsibility of extracting a latent feature of input, whereas GAN specifi-cally examines classification of a user real interaction vector and a generated vector which supports the VAE networks. GAN is applied exclusively for training phrase.

D2D-TM comprises six subnetworks including two domain click vector encodersE_A andE_B, two domain click vector generatorsG_AandG_B, and two domain adversarial discriminators D_A and D_B. I maintain the framework structure as explained in a report of an earlier study[32]. In addition, I share weights of the last few layers in E_AandE_B, so that my model not only extracts different characteristics of two model in the first layers; it also learns their similarities. In parallel, I also share weights of the few first layers inG_Aand GB to make my model able to generate both similar and divergent features. In Figure 5.1, share layers are denoted asS, whereas distinct layers are denoted asD.

In training, the user interaction vectors for domain A and B are extracted high-light representations byDlayers in the encoder; then these features are shared weight inSlayers in assumption that user has some consistency behaviors among domains.

Furthermore, I obtain latent vector z_A andzB, which are used for not only recon-structing interaction vectors, but also generating interactions for opposite domains.

To generate, latent vectorz_Afor domain A is reconstructed bySlayers, then masked byDlayers inGB. Finally, the GAN discriminator is used to detect which vector was generated from the other source.

5.3.2 VAE

VAE includes two processes: an encoder that maps inputxto a latent representation zand a generator that re-maps z to x_rec: z ∼ q(z|x)and x_rec ∼ p(x|z), withq(z|x) andp(x|z)are two conditional distributions.

In a deep learning network, to make training with back-propagation possible, a reparameterization trick [24] is applied to express a random variablezas a deter-ministic variablez = µ+σe, where µ is a mean vector and σ is a vector that consists of a diagonal component of the covariance matrix. Bothµ and σ are out-puts of the encoder network with inputx, denoted by E(x). Also, signifies an element-wise product; eis generated from a Gaussian distribution N(0,I)with I as the identity matrix. However,xrecwill be the output of generator network with inputzasxrec=G(z).

It is noteworthy that VAE training is aimed at minimizing a variational upper bound, which is

L=KL(q(z|x)kp(z))−_E_q₍_z_|_x₎[logp(x|z)] =L_KL+L_rec, (5.1)

with L_KL =KL(q(z|x)kp(z)), and L_rec=−_E_q₍_z_|_x₎[logp(x|z)], whereKLis the Kullback–Leibler divergence.

In my model, the encoder–generator pair{E_A,G_A}constitutes a VAE for domain A, term VAE_A. As explained above, the distribution of the latent codez_A, which is generated fromq_A(_z_A|_x_A), is given asµA+_σ_A_e_with_µ_A_and_σ_Aas the output of

46 Chapter 5. Domain-to-Domain Translation Model [36]

encoder networkE_A. In this case, bothq_A(_z_A|_x_A)andp(_z_A)are Gaussian distribu-tions. Therefore,

L_KL_A =_KL(q_A(_z_A|_x_A)kp(_z_A))

= ¹ 2

∑

K k=1

(1+log(σ_Ak² )−µ²_Ak−σ²_Ak),

withKas the dimension ofzand whereσ_Ak,µ_Ak respectively represent elements of vectorσA andµA. Then, I try to generate vector xAA by a conditional distribution p_G_A(x_A|z_A), which means thatx_AA is a reconstruction of the input click vector x_A through generator networkG_Awith inputz_A:

x_AA ∼ p_G_A(x_A|z_A).

Assume that the click vector of useru for domain A isx_A = [x_A1,· · · ,x_AI_A]^T, and that the number of clicks isN_A, then∑_i^I^A x_Ai = N_A. However, letπ_A = f(G_A(_z_A)) with f(.) is softmax function, so ∑^I_i^Aπ_Ai = 1. Therefore, reconstruction vector x_AA of this user can be a sample from a multinomial distribution Mult(N_A,π_A)or p_G_A(x_A|z_A) =Mult(N_A,π_A). Therefore, the reconstruction loss forx_AAis

L_rec_A =−_E_z_A_∼_q_A₍_z_A_|_x_A₎[logp_G_A(x_A|z_A)]

=−_E_z

A∼q_A(z_A|x_A)[

∑

x_Ailogπ_Ai]. Hereinafter, I also use a multinomial distribution forp_G_B.

L_VAE_A =λ₁L_KL_A +λ₂L_rec_A. (5.2) The hyperparametersλ₁andλ₂control the weights of the reconstruction term. The KLdivergence terms penalize deviation of the latent code from the prior distribu-tion.

Similarly, {E_B,G_B} constitutes a VAE for domain B: The distribution of latent code z_B, which is generated from q_B(z_B|x_B), is given as µ_B +σ_Be. The recon-structed click vector isxBB ∼ pG_B(_x_B|zB). In addition,

L_VAE_B =λ₁L_KL_B +λ₂L_rec_B

=λ₁KL(q_B(z_B|x_B)kp(z_B))−λ₂E_z_B∼qA(zB|xB)[logp_G_B(x_B|z_B)]. (5.3) 5.3.3 Domain Cycle-Consistency (CC) and Weight-Sharing

I can translate a click vectorx_Ain domain A to a click vector in domain B through applyingp_G_B(x_B|z_A), termsx_AB. Similarly, click vectorx_BAfrom domain B to domain A is generated asp_G_A(_x_A|_z_B)_.

To ensure thatx_AB ≈x_Bandx_BA≈x_A, first, I enforce a weight-sharing constraint relating two VAEs. Specifically, I share the weights of the last few layers ofE_A and E_B that are responsible for extracting high-level representations of the input click vectors in the two domains. In parallel, I share the weights of the first few layers ofG_AandG_Bresponsible for decoding high-level representations for reconstructing the input click vector. Weight-sharing usually is used in parallel architectures which two networks are trained simultaneously. In my case, weight-sharing not only helps my model converge better, but also supports encoders to extract common features

5.3. Method 47 between two domains. Moreover, because neurons corresponding to same features are triggered in various scenarios, weight-sharing can improve generating ability of my model.

However, weight sharing alone does not guarantee that two domain are matched.

I propose a domain cycle consistency with two cycles to constrain representations between two domains. Cycle consistency is a way of using transitivity to supervise CNN training, which is applied in many image-to-image translation papers [59, 32].

This loss pushes encoder and decoder to be consistent into each others. In detail, with domain A, first, I constrainx_AB, which is generated fromx_A, closes tox_B.

x_AB ∼ p_G_B(xB|z_A).

Then, I re-mapxAB to domain A and compel it to close toxA. x_ABA ∼ p_G_A(x_A|z_AB)wherez_AB ∼q_B(z_AB|x_AB).

With same encoder and decoder network as VAE, I apply VAE loss function to do-main cycle consistency as:

L_CC_A =L_rec_AB+L_KL_AB+L_rec_ABA

=−λ₃E_z_A∼qA(zA|xA)[logp_G_B(x_B|z_A)] +λ₄KL(q_B(z_AB|x_AB)kp(z_AB))

−λ₃E_z_AB∼qB(zAB|xAB)[logp_G_A(x_A|z_AB)]. (5.4) As VAE, I also have hyperparameterλ₃andλ₄to control weights among two terms.

As domain A, with domain B, I have:

x_BA ∼ p_G_A(x_A|z_B)

x_BAB ∼ p_G_B(x_B|z_BA)wherez_BA ∼q_A(z_BA|x_BA). And, loss cycle consistency of domain B is:

L_CC_B =L_rec_BA+L_KL_BA+L_rec_BAB

=−λ₃E_z_B∼qA(zB|xB)[logp_G_A(x_A|z_B)] +λ₄KL(q_A(z_BA|x_BA)kp(z_BA))

−λ₃Ez_BA∼q_A(z_BA|x_BA)[logp_G_B(x_B|z_BA)]. (5.5) 5.3.4 Generative Adversarial Network (GAN)

GAN generally includes two processes: a generator and discriminator. Whereas the discriminator functions to recognize real and generated data, the generator is de-signed to generate fake ones that resemble real ones. This competition drives both processes to improve their network until the counterfeits are indistinguishable from the genuine articles [14]. In my model, VAE with cycle consistency works as a gen-erator process. I have two GANs:GAN_A= {VAE_A,D_A}andGANB ={VAEB,DB}.

With domain A, there are three outputs of VAE:

x_AA ∼ p_G_A(x_A|z_A) xBA∼ p_G_A(_x_A|_z_B) xABA∼ p_G_A(_x_A|_z_AB).

However, I mainly emphasize resampling of a click vector from domain B to domain A. My discriminator process will be used to detect the generated click vectorxBAand

48 Chapter 5. Domain-to-Domain Translation Model [36]

real click vectorxA. Then, optimizing GAN for domain A will yield

L_GAN_A =λ₀E_x_A∼PA[logDA(xA)]+λ₀E_z_B∼qB(zB|xB)[log(1−D_A(x_BA)]. (5.6) Like domain A, I try to detect generated click vectorx_AB and real vectorx_B. Then the loss discriminator of domain B will be

L_GAN_B(E_A,G_B,D_B) =λ₀E_x_B∼PB[logDB(xB)]+λ₀E_z_A∼qA(zA|xA)[log(1−D_B(x_AB)]. (5.7) 5.3.5 Learning

I solve the learning problems of VAE_A, VAEB, CC_A, and CCB, GAN_A, and GANB

jointly as

E_A,EminB,G_A,GB

Dmax_A,DB

[L_VAE_A(E_A,G_A) +L_GAN_A(E_B,G_A,D_A) +L_CC_A(E_A,G_A,E_B,G_B) +L_VAE_B(E_B,G_B) +L_GAN_B(E_A,G_B,D_B) +L_CC_B(E_B,G_B,E_A,G_A)],

(5.8) whereL_VAE_A,L_VAE_B,L_GAN_A, andL_GAN_B are defined respectively in 5.2, 5.3, 5.6 and 5.7.

First, I pre-trainVAE_AandVAE_Bseparately to extract the representations of two domains. Then, because GAN works as a competition among generator and discrim-inator while the generator tries to make a generated vector resemble a real vector, the discriminator attempts to classify them. I will optimize the generator and dis-criminator process sequentially. I summarize the training process as

1. Minimize generator

L_gen =L_VAE_A+L_VAE_B+L_CC_A +L_CC_B+log(1−D_A(x_BA)) +log(1−D_B(x_AB)). 2. MaximizeL_GAN_A andL_GAN_B separately

3. Repeat Steps 1 and 2 until convergence.

5.3.6 Predict For Cross-Domain

• From domain A to domain B: Here I assume that user u only clicked some items in domain A, and has no interaction with any item in domain B. I have a history click vector x_A. Then I want to recommend items in domain B to him by generating vector xAB in which the higher probability means greater interesting items to this user. First, encoder E_A extracts highlight features of x_A withz_A ∼ q_A(z_A|x_A). Thenz_Ais masked with weight features of domain B throughxAB ∼ p_G_B(_x_B|_z_A)

• From domain B to domain A: Similarly, with a history click vectorx_Bof useru in domain B, I predict click vectorxBAin domain A asxBA ∼ p_G_A(_x_A|zB)_with z_B ∼q_B(z_B|x_B).

5.4. Experiments 49

ドキュメント内東北大学機関リポジトリTOUR (ページ 65-70)