Model - 東北大学機関リポジトリTOUR

4.4.1 Partially Labeled and Supervised Topic Model

Below, we define the partially labeled supervised LDA (PLS-LDA) model, which combines the labeled topic and supervised topic models.

First, we show how to determine the labeled topics using labeled words. Af-ter removing unnecessary words, we consider the total vocabulary, which included all words in customer reviews as text content. To make the topics interpretable, some representative words are selected from the total vocabulary of each topic, as discussed in the previous section; we call these wordslabeled wordsas follows. Si-multaneously, the remaining words with no predefined labels are callednonlabeled wordsand assigned to topics according to a specific distribution.

To develop the specific vocabulary of the topickfrom the total vocabulary avail-able, we employ a methodology from the existing literature such as labeled LDA (Ramage et al., 2009) or joint-sentiment LDA (Lin et al., 2012; Tirunillai and Tellis, 2014). We first generate the topic’stransformation matrix Λ⁽^k⁾ (V_k×V), conditional onλ₁,λ₂, . . . ,λ_V_k, whereλ_i,i = 1, . . . ,V_k, is the sequence of the number of labeled words for topickand nonlabeled words,V_k andVrepresent the number of labeled words and nonlabeled words for topickand the number of total vocabulary in the dataset, respectively. For each rowi ∈ {1, 2, . . . ,V_k}_{and column} j ∈ {1, 2, . . . ,V}_, we set

Λ⁽_ij^k⁾=











1 ifλ_i =j 0 otherwise

(4.1)

In this model, the topic-vocabulary vector φ_k, called the word distribution, is as-sumed to follow the Dirichlet distribution, φ_k ∼ Dirichlet(β^∗_k), where β^∗_k is trans-formed from theV-dimensional hyper parameter βvia the transformation matrix, β^∗_k =_Λ⁽^k⁾β. Hence, the number of dimensions ofβ^∗_k isV_k.

The remaining LDA part of the model is not restricted and followed conventional LDA model procedure. That is, the topic assignment for then-th word in reviewd which consists ofN_dwords without considering their order (bag-of-words assump-tion) is assumed to follow a categorical distribution,z_dn ∼ categorical(θ_d), whereθ_d

4.4. Model 89 is a topic distribution that represents a topic proportion within the text of the review d and follows the Dirichlet prior distribution, θ_d ∼ Dirichlet(α). We assume that a word assigned to topick is generated from the corresponding word distribution, w_dn|z_dn= k∼categorical(φ_k)_.

Next, we develop the response function part of our model. We use two depen-dent variables: satisfaction score and helpfulness count. The satisfaction score re-flects the current evaluation of customers based on their past experiences that al-ready purchased the product and we assumed that the helpfulness count implied the interest in and expectations of the product by other consumers that might pur-chase it in the future. In addition, as discussed in the previous section, the labeled topics in customer reviews are directly built as covariates for the variations in sat-isfaction scores and as helpful product expectation references. For example, if one product feature is particularly satisfying to customer needs, we expect that the topic related to this feature co-occurs with a high satisfaction score in online customer re-views. Conversely, topics related to dissatisfying features should co-occur with low satisfaction scores. Regarding its connection with review helpfulness, a textual re-view is likely to be regarded as helpful by readers if it contains the topics in which they are interested.

Given the word-topic assignments,z_dn,N_dl =_∑_n^N₌^d₁ = _I{z_dn =l}is the number of words assigned to topicl, which we use as covariates after logarithmic transfor-mation. Note that the number of covariates for these topic assignments is the same as the number of labeled topics,L(which is set to five in this study). However, the number of total topics can differ from the number of labeled topics. Let the number of total topics beKand the rest of the topics,K−L, are nonlabeled and do not in-clude any labeled words. Nonlabeled topics are extracted from the review text but are not considered as covariates for both object variables for the sake of manageabil-ity.

The number of total topics, K, is determined by model selection criteria of the deviance information criterion (DIC, Spiegelhalter et al., 2002) and the widely ap-plicable information criterion (WAIC, Watanabe, 2010), as demonstrated in Section 4.5. In addition to the labeled topic variables, some control variables also work as covariates. We use the following four status variables: purchase verification, top

contributor, top reviewer, and vine voice. The status badges are displayed next to the user icon if the user qualified for that status. Word count variables, which repre-sent the number of words included in the customer review, are also considered.

We assume that the satisfaction score, which was measured using a five-point scale, follows the ordered probit model and the helpfulness count with positive in-tegers follow the Poisson regression model. First, let the satisfaction score of a review dbey_s,d, which follows the ordered probit model:

y_s,d = r, ifτ_r−1≤ y^∗_s,d <τr

y^∗_s,d =

∑

L l=1

γ_s,l·log(N_dl+1) +

∑

5 m=1

δs,m·x_s,dm+e_d; e_d ∼N(0, 1). (4.2) In the above, the thresholds{τ_r}work for realizing discrete satisfaction scores through the latent continuous variable, y^∗_s,d, and the thresholds τ0 and τR (R = 5 in the Amazon dataset) are set to−_∞ and∞, respectively. x_s,d is a vector of the control variables– purchase verification, top contributor, top reviewer, vine voice, and word counts. The error terme_d is assumed to follow a standard normal distribution and the model does not include the intercept term for identifyingR−1 thresholds.

Next, when the satisfaction rating score y_s,d is given, we define the response model of the helpfulness county_h,dfor reviewdusing the Poisson regression model:

y_h,d ∼ Poisson(y^∗_h,d) y^∗_h,d =

∑

L l=1

γ_h,l·log(N_dl+1) +

∑

5 m=1

δ_h,m·x_h,dm+δ_h,6y_s,d, (4.3) wherex_h,d is common with satisfaction probit model, Equation (4.2) and y_s,d is in-cluded by the findings of literature (e.g., Ho-Dac, Carson, and Moore, 2013; Mauri and Minazzi, 2013; Ludwig et al., 2013), which demonstrate that positive and nega-tive online customer reviews affect reader purchase intentions and expectations. We also explore the effect of the level of customer satisfaction on the perceived helpful-ness of readers.

Therefore, the satisfaction and helpfulness models in Equations (4.2) and (4.3) are sequentially connected by way of observationy_s,d to form the integrated PLS-LDA

4.5. Empirical Analysis 91

ドキュメント内東北大学機関リポジトリTOUR (ページ 105-108)