• 検索結果がありません。

Methods

ドキュメント内 Consumer Behavior of eHealth Services (ページ 74-79)

Chapter 5: Predicting Consumer Behavior

5.3 Methods

To attain the research objective we followed several steps in our research methodology which is shown in Figure 5.1.

Figure 5.1 Research Methodology 5.3.1 Sampling and data collection

Data were collected between June and July 2016 through a field survey based on a structured questionnaire. The survey was conducted in Bheramara sub-district of Kushtia, a North-Western district of Bangladesh where PHC is serving since 2012. Close-ended

Predicting Consumer Behavior 63

Graduate School of Information Science and Electrical Engineering, Kyushu University questions were used to extract respondents’ demography and a five-point Likert-scale from extremely disagree to extremely agree with a neutral point on 3 was used to extract the cognitive variables. The questionnaire was prepared initially in English and later was translated to Bengali (the local language of Bangladesh). A pilot study was conducted on 7 randomly selected 18+ rural patients to assess the understandability of the questionnaire.

Their feedback was considered to review the questionnaire. To maintain the right of privacy of the respondents, they have been briefed on the research purpose and asked whether they want to participate in the survey as well allow us to use their response in our scientific publications.

A total of 592 questionnaire was distributed randomly, however, after removing unsuitable cases due to missing fields and partially answered questionnaires, we could include 292 respondents as our effective sample size to carry forward with. The sample was drawn by a simple random sampling method which eliminates the bias by giving all individuals an equal chance to be chosen [51]. Beleites et al. [123] suggested a minimum of 75 to 100 samples per class is required to have a good but not perfect classifier.

Figueroa et al. [124] examined a total of 568 supervised learning based classification models and found models with sample size between 80 to 560 achieved optimum performance. We, therefore, consider 292 as a moderately optimum sample size for our study.

5.3.2 Feature selection

In predictive modeling, feature selection is required to minimize redundancy and computational effort while maximizing prediction accuracy by keeping the most relevant but not redundant features [163]. M. Hall [164] suggested ‘correlation-based feature selection’ as one of the most widely used and easy-to-explain methods for machine learning classification model. In this study, we have selected 12 features out of 14 based on their correlation coefficient (r) with the dependent variable and level of significance (P-value).

Predicting Consumer Behavior 64

Graduate School of Information Science and Electrical Engineering, Kyushu University 5.3.3 Selection of algorithms

Since our dependent variable is eHealth use and the response is either ‘Yes’ or ‘No’, thus, we are dealing with a binary classification problem. Existing studies [165]–[168] suggest logistic regression, boosted decision tree, support vector machine, and artificial neural network algorithms perform better in binary classification. In our study, therefore, we selected the four aforementioned machine learning algorithms to predict eHealth usage among rural consumers in Bangladesh.

5.3.4 Cross-validation

Cross-validation method is one of the most frequently used techniques in predictive modeling to reduce bias and over-fitting which is commonly known as misclassification error. In this study, we applied a 10-fold cross-validation method to evaluate the validity of present results and to make predictions from unobserved new data. In this method, each of the 10 subsets acts as an independent holdout test set for the model trained with the rest of the subsets. A pair of testing and training sets is called a ‘fold’. Borra et al. [169]

and Kohavi [170] suggested 10-fold cross validation is an optimal method to reduce bias and over-fitting of the data. Figure 2 shows increasing the number of subset (fold) up to 10, the misclassification error reduces significantly. However, after 10 it become stabilized or even in some cases it may slightly increase. It is, therefore, recommended by many researchers to follow 10-fold cross validation method to reduce misclassification error or overfilling.

Predicting Consumer Behavior 65

Graduate School of Information Science and Electrical Engineering, Kyushu University Figure 5.2 Optimum subset for cross-validation

[Source: R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” Proc. Int. Jt. Conf. Artif. Intell., vol. 14, no. 2, pp. 1137–1145, 1995.]

5.3.5 Performance evaluation

Existing studies [171], [172] suggest five most commonly used indexes namely, Accuracy, Precision, Recall, F-Score, and AUC to evaluate the performance of binary classifiers.

We, therefore, adopted those aforementioned five indexes to evaluate and compare model’s performance in our study. The indexes can be calculated according to the figures in Table 5.1 and the following formulas, respectively.

Table 5.1 Contingency table for performance evaluation Actual Positive Actual Negative

Predicted Positive TP FP

Predicted Negative FN TN

TP, true positive; FP, false positive; FN, false negative; TN, true negative

Predicting Consumer Behavior 66

Graduate School of Information Science and Electrical Engineering, Kyushu University

Accuracy measures the overall effectiveness of a classification model as the proportion of true results to total cases.

(TP+TN) / (TP+FP+TN+FN)

Precision measures the proportion of true results to all positive results.

TP/(TP+FP)

Recall measures the effectiveness of a classifier to identify positive results.

TP/(TP+FN)

F-score is computed as the weighted harmonic mean of precision and recall between 0 and 1, where the ideal F-score value is 1.

2 (precision x recall) / (precision + recall)

AUC measures the classifier’s ability to avoid false classification. The area under the curve (AUC) is plotted with true positives on the y-axis and false positives on the x-axis. This metric is useful because it provides a single number that lets you compare models of different types. When the area is closer to 1, the model is better.

½ {TP/(TP+FN)+TN/(TN+FP)}

We performed the experiment using Microsoft Azure Machine Learning Studio, a cloud-based computing platform that allows to build, test, and deploy predictive analytics solutions [173]. Figure 5.3 shows the machine learning model which is used for this experiment.

Predicting Consumer Behavior 67

Graduate School of Information Science and Electrical Engineering, Kyushu University Figure 5.3 Predictive Model in Microsoft Azure ML Studio

In this experiment, at first, we coded the raw data into numeric form then applied correlation-based feature selection module to keep the most relevant but not redundant features into the model. Four supervised machine learning algorithms namely logistic regression, boosted decision tree, support vector machine, and artificial neural were applied to predict the consumer purchase behavior based on cross-validation method.

Finally, the algorithms’ performance were measured and compared by five indexes namely, Accuracy, Precision, Recall, F-Score, and AUC.

ドキュメント内 Consumer Behavior of eHealth Services (ページ 74-79)