θ˜d3)>, where ˜θdlis assumed to follow the Dirichlet distribution as prior, ˜θdl ∼Dirichlet(θ˜0). Therefore, the number of dimensions ofθd, or the total number of topics, isK1+K2+ K3 which are the number of negative, neutral, and positive topics, respectively. In this study, unlike LDA2vec, we take topic assignments following the topic distribu-tion that takes into account the polarity of the document.
The proposed model uses the word vectors, topic vectors, and the topic assign-ments to the words to construct the context vectors representing what the word means in the context, which the sum of the word vector and the topic vector cor-responding the topic assignment to the word i, −→
ci = −→ wi +−→
t k, if zi = k. This formulation stems from the idea of natural language processing that the meaning of a word is expressed on the basis of the original meaning of the word, but it can fluctuate by taking into account the context of the document or sentence in which the word is used. The formulation of the context vectors of this study is almost same as that of Moody (2016)’s LDA2vec.
In the process of text generation model, the proposed model considers its sur-rounding words to define the probability of generating the focal word. We defineSi as a multiset that contains the surrounding words of the wordiin the dataset. Let IP = {(i,j)| j∈ Si}andIN = {(i,j)|j∈/Si}be the positive and negative multisets, respectively, andID = IP∪IN be the multiset of total vocabulary. Then, we define G= {gij |(i,j)∈ IG}, wheregij is a random variable whose value is taken to be 1 if (i,j)∈ IPor−1 if(i,j)∈ IN. In the proposed model, this random variable indicating whether the wordiand the wordjare in the same window is assumed to follow the binomial logit model, which the probability of the variable is defined as the stan-dard sigmoid (or logistic) function of the inner product of the context vector and the surrounding word vector,p(gij | −→
wi,−→ wj,{−→
t k},zi) =σ(gij· −→ ci>−→
wj), whereσ(x) = 1/1+exp(−x). Because the dependent variable of the logit model (−→
c i) is also latent variable, this formulation can be seen as the factor model by regarding−→
c ias factor scores of the wordiand−→
wj as factor loadings of the word j. However, this factor model has a constraint that factor score−→c
ican be decomposed into the word vector and the corresponding topic vector,−→
ci = −→ wi+−→
t k.
5.3. Model 111 Therefore, the likelihood of the word embedding part is provided as follows:
p(G,{−→w
i},{−→
t k},Z|θ,˜ π) =
∏
V i=1(
p(zi |θ˜di,πdi)
∏
V j=1,i6=jσ(gij· −→c >
i −→w
j) )
, (5.1)
whereV is the total number of vocabulary in the corpus. In the proposed model, the embedding vectors are trained to maximize the probability of the inner product of the context vector and the surrounding word vector, so that word vectors, which their corresponding words often belong to the same window in a dataset will have similar values. Therefore, we can obtain vector representations taking into account the co-occurrence of words within the window, that is, the context, while LDA uni-formly considers the co-occurrence of words in a document.
However, the computation cost of the likelihood (5.1) is high because the number of total vocabulary is usually over thousands and the total cost of the likelihood is its square. In this study, according to the approach proposed by Mikolov et al. (2013), we also apply a negative sampling technique to approximate the likelihood (5.1) with the following computable formulation:
p(G,{−→ wi},{−→
t k},Z|θ,˜ π)≈
∏
V i=1(
p(zi |θ˜di,πdi)
∏
j∈Si
σ(−→ c>i −→
wj)×
∏
N n∼Pn(w)σ(−−→ ci>−→
wn) )
, (5.2) where Pn(w) is the noise distribution as a free parameter, and we choose the uni-gram distribution raised to the 3/4 th power according to the Mikolov et al. (2013)’s suggestion. The number of negative samples, N, is suggested to be 5−20 for a small dataset and 2−5 for a large dataset. In machine learning research, since they use a huge dataset containing millions and sometimes billions words for training, our dataset is relatively small. We determine 15 words for the number of negative samples.
5.3.2 Preference Measurement Models Considering Brand Heterogeneity and Consumer Attributes
Another extension of this study from Moody (2016)’s LDA2vec model is to com-bine the word embedding model considering text topic and sentiment as explained above and preference measurement model considering brand heterogeneity and
consumer attributes in a supervised learning fashion. Recalling that the topic as-signments is assumed to follow the categorical distribution of the topic proportion, zi ∼categorical(θdi), the topic proportionθdrepresents the summary of the product attributes mentioned in the reviewd, and in the preference measurement model, it works as dependent variables for explaining the customer satisfaction or the review rating score of the reviewd.
Letyd be the satisfaction score of the review dthat can take values from 1 toR (Rdepends the e-commerce site platform, for example,R = 5 in Amazon), and we define two ordered probit models to clarify the structure of the satisfaction scores according to the two conceivable processes that consumer attributes directly or indi-rectly affects the satisfaction structure. First, in the direct effect model, the consumer attributes work as dependent variables for satisfaction score to capture their direct effects on the satisfaction structure as follows.
yd = r ifτr−1 ≤y∗d< τr y∗d = αbd +
∑
K k=1βbdkθdk+
∑
Q q=1Xdqγq+ed, ed∼ N(0,σ2), (5.3)
whereKis the total number of topics including negative, neutral, and positive top-ics, and the thresholds{τr}work for realizing discrete satisfaction scoresydthrough the latent continuous variabley∗d, and the both sides of thresholds, τ0 andτR, and the next two thresholds,τ1andτR−1, are set to−∞and∞, the points of the empir-ical ratio of the corresponding rating scores in the normal cumulative distribution, respectively, for model identifiability. Also,bd indicates the brand for which the re-viewdwrote, and we introduce the brand heterogeneity in the brand interceptαand the coefficients of the topic distributionβ. Xd is dummy variables of the reviewer d’s categorical attributes, such as age and status of the reviewer ranking, and γ is the coefficient vector capturing the direct effects of the consumer attributes on the satisfaction structure.
Next, we construct different model from the above direct effect model to under-stand the effects of the consumer attributes on the product attributes that the re-viewers mention in the review; that is to say, it is the indirect effect of the consumer attributes on the satisfaction structure through the hierarchical structure of the topic
5.3. Model 113 distribution. We assume the Dirichlet prior for the topic distribution of each polarity θ˜dlin the above direct effect model, but the indirect effect model considers the prior hierarchical structure for the topic distributions as follows.
θ˜dl = softmax ¯θdl θ¯dl =
∑
Q q=1Xdqγlq+λdl, λdl ∼ MV N(0, 0.12I). (5.4)
Then the indirect effect model does not consider the direct effects of the consumer attributes on the satisfaction structure,y∗d =αbd +∑Kk=1βbdkθdk+ed.
Therefore, the likelihood of the preference measurement direct effect model for the customer satisfaction is provided as follows:
p(Y,τ,α,β,γ,σ |θ,˜ π,X) = ( D
d
∏
=1p(yd |y∗d,τ)p(y∗d|θ˜d,πd,αbd,βbd,γ,σ,X) )
×p(τ,α,β,γ,σ), (5.5)
and that of the indirect effect model is provided as follows:
p(Y,τ,α,β,γ,σ|θ,˜ π,X) = ( D
d
∏
=1p(yd|y∗d,τ)p(y∗d |θ˜d,πd,αbd,βbd,σ) )
×p(τ,α,β,γ,σ), (5.6)
where p(τ,α,β,γ,σ)is the joint prior distribution for the coefficients, and the defi-nition is explained in the Appendix A.4. Under the assumption of the conditional independence of likelihood (5.1) and (5.5) or (5.6) when the topic distributions are given, the full joint likelihood of the proposed model is obtained by the product of these equations multiplied by the prior density for the topic distribution, which is the the Dirichlet priorp(θ˜dl | θ˜0) ∼ Dirichlet(θ˜0)in the direct effect model and the multivariate normal priorp(θ˜dl |Xd,γl)∼ MV N(Xd>γl, 0.12I).
Finally, we briefly introduce the estimation procedure for the proposed models.
In estimation procedure of the proposed model, we take a hybrid approach com-bining the Markov Chain Monte Carlo sampling method and the gradient-based stochastic optimization using Adam (Kingma and Ba, 2015). The stochastic opti-mization gives the optimal point estimates at that iteration for the two embedding
vectors, and then we sample from the posterior distributions using the Metropolis-Hastings algorithm for the topic distributions and the Gibbs sampling for the re-maining parameters, given the point estimates for the embedding vectors updated for every iteration. The estimation procedure and the settings of analysis in the fol-lowing empirical study are explained in more detail in the Appendix A.4.