Chapter 5: Predicting Consumer Behavior
5.3 Methods
To attain the research objective we followed several steps in our research methodology which is shown in Figure 5.1.
Figure 5.1 Research Methodology 5.3.1 Sampling and data collection
Data were collected between June and July 2016 through a field survey based on a structured questionnaire. The survey was conducted in Bheramara sub-district of Kushtia, a North-Western district of Bangladesh where PHC is serving since 2012. Close-ended
Predicting Consumer Behavior 63
Graduate School of Information Science and Electrical Engineering, Kyushu University questions were used to extract respondents’ demography and a five-point Likert-scale from extremely disagree to extremely agree with a neutral point on 3 was used to extract the cognitive variables. The questionnaire was prepared initially in English and later was translated to Bengali (the local language of Bangladesh). A pilot study was conducted on 7 randomly selected 18+ rural patients to assess the understandability of the questionnaire.
Their feedback was considered to review the questionnaire. To maintain the right of privacy of the respondents, they have been briefed on the research purpose and asked whether they want to participate in the survey as well allow us to use their response in our scientific publications.
A total of 592 questionnaire was distributed randomly, however, after removing unsuitable cases due to missing fields and partially answered questionnaires, we could include 292 respondents as our effective sample size to carry forward with. The sample was drawn by a simple random sampling method which eliminates the bias by giving all individuals an equal chance to be chosen [51]. Beleites et al. [123] suggested a minimum of 75 to 100 samples per class is required to have a good but not perfect classifier.
Figueroa et al. [124] examined a total of 568 supervised learning based classification models and found models with sample size between 80 to 560 achieved optimum performance. We, therefore, consider 292 as a moderately optimum sample size for our study.
5.3.2 Feature selection
In predictive modeling, feature selection is required to minimize redundancy and computational effort while maximizing prediction accuracy by keeping the most relevant but not redundant features [163]. M. Hall [164] suggested ‘correlation-based feature selection’ as one of the most widely used and easy-to-explain methods for machine learning classification model. In this study, we have selected 12 features out of 14 based on their correlation coefficient (r) with the dependent variable and level of significance (P-value).
Predicting Consumer Behavior 64
Graduate School of Information Science and Electrical Engineering, Kyushu University 5.3.3 Selection of algorithms
Since our dependent variable is eHealth use and the response is either ‘Yes’ or ‘No’, thus, we are dealing with a binary classification problem. Existing studies [165]–[168] suggest logistic regression, boosted decision tree, support vector machine, and artificial neural network algorithms perform better in binary classification. In our study, therefore, we selected the four aforementioned machine learning algorithms to predict eHealth usage among rural consumers in Bangladesh.
5.3.4 Cross-validation
Cross-validation method is one of the most frequently used techniques in predictive modeling to reduce bias and over-fitting which is commonly known as misclassification error. In this study, we applied a 10-fold cross-validation method to evaluate the validity of present results and to make predictions from unobserved new data. In this method, each of the 10 subsets acts as an independent holdout test set for the model trained with the rest of the subsets. A pair of testing and training sets is called a ‘fold’. Borra et al. [169]
and Kohavi [170] suggested 10-fold cross validation is an optimal method to reduce bias and over-fitting of the data. Figure 2 shows increasing the number of subset (fold) up to 10, the misclassification error reduces significantly. However, after 10 it become stabilized or even in some cases it may slightly increase. It is, therefore, recommended by many researchers to follow 10-fold cross validation method to reduce misclassification error or overfilling.
Predicting Consumer Behavior 65
Graduate School of Information Science and Electrical Engineering, Kyushu University Figure 5.2 Optimum subset for cross-validation
[Source: R. Kohavi, “A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection,” Proc. Int. Jt. Conf. Artif. Intell., vol. 14, no. 2, pp. 1137–1145, 1995.]
5.3.5 Performance evaluation
Existing studies [171], [172] suggest five most commonly used indexes namely, Accuracy, Precision, Recall, F-Score, and AUC to evaluate the performance of binary classifiers.
We, therefore, adopted those aforementioned five indexes to evaluate and compare model’s performance in our study. The indexes can be calculated according to the figures in Table 5.1 and the following formulas, respectively.
Table 5.1 Contingency table for performance evaluation Actual Positive Actual Negative
Predicted Positive TP FP
Predicted Negative FN TN
TP, true positive; FP, false positive; FN, false negative; TN, true negative
Predicting Consumer Behavior 66
Graduate School of Information Science and Electrical Engineering, Kyushu University
Accuracy measures the overall effectiveness of a classification model as the proportion of true results to total cases.
(TP+TN) / (TP+FP+TN+FN)
Precision measures the proportion of true results to all positive results.
TP/(TP+FP)
Recall measures the effectiveness of a classifier to identify positive results.
TP/(TP+FN)
F-score is computed as the weighted harmonic mean of precision and recall between 0 and 1, where the ideal F-score value is 1.
2 (precision x recall) / (precision + recall)
AUC measures the classifier’s ability to avoid false classification. The area under the curve (AUC) is plotted with true positives on the y-axis and false positives on the x-axis. This metric is useful because it provides a single number that lets you compare models of different types. When the area is closer to 1, the model is better.
½ {TP/(TP+FN)+TN/(TN+FP)}
We performed the experiment using Microsoft Azure Machine Learning Studio, a cloud-based computing platform that allows to build, test, and deploy predictive analytics solutions [173]. Figure 5.3 shows the machine learning model which is used for this experiment.
Predicting Consumer Behavior 67
Graduate School of Information Science and Electrical Engineering, Kyushu University Figure 5.3 Predictive Model in Microsoft Azure ML Studio
In this experiment, at first, we coded the raw data into numeric form then applied correlation-based feature selection module to keep the most relevant but not redundant features into the model. Four supervised machine learning algorithms namely logistic regression, boosted decision tree, support vector machine, and artificial neural were applied to predict the consumer purchase behavior based on cross-validation method.
Finally, the algorithms’ performance were measured and compared by five indexes namely, Accuracy, Precision, Recall, F-Score, and AUC.