(様式6号) 「課程博士用」
学 位 論 文 の 要 旨
専 攻 名 システム工学 専 攻 氏 名
ふ り が な駱
らく
曦
ぎ
○
印学位論文題目
Improvement of Automatic Chinese Text Classification by Combining Character-based and Word-based Approaches
(和訳 中国語テキスト自動分類に関する研究)
Automatic text classification (ATC) is the task of automatically assigning one or more appropriate categories for a document according to its content or topic. Traditionally, text classification is carried out by human experts as it requires a certain level of vocabulary recognition and knowledge processing. With the rapid explosion of texts in digital form and growth of online information, text classification has become an important research area owing to the need to automatically handle and organize text collections.
Many standard machine learning techniques have been applied to automated text classification problems, and k-Nearest Neighbor algorithm (kNN) and Support Vector Machine (SVM) have been reported as the top performing methods for English text classification. However, the studies on Chinese text classification are less sufficient compared with English, and Chinese text has its own characteristics.
As there is no natural delimiter between Chinese words, this means that Chinese segmentation is necessary before any other preprocessing. Numerous different segmentation approaches have been proposed for Chinese text classification. These approaches can be basically divided into character-based approach and word-based approach. Since there are no publicly available Chinese corpora, it is difficult to tell which method performed better.
We first performed Chinese text classification using character-based (N-gram) approach. Experimental results show that the combination of uni-gram and bi-gram (1+2-gram) proved to be the most efficient method to represent Chinese document.
We experimentally evaluated the effectiveness of feature transformation techniques including normalizing absolute frequency to relative frequency and power transformation. The results show a significant improvement in performance.
N-gram and word segmentation extraction on a large corpus will yield a large number of possible features. In ATC, high dimensionality of the feature space may be problematic in terms of computational time and storage resources. Experiments prove that Principal Component Analysis is an efficient and effective way to reduce the dimensionality.
続紙 有☑ 無□
(様式6号-続紙)「課程博士用」
氏 名
ふ り が な
駱
らく
曦
ぎ
○
印Then, we presented several experiments based on word-based approach. We proposed a novel feature selection method based on part-of-speech analysis. According the components of Chinese texts, we utilize the words' part-of-speech attributes to filter lots of meaningless terms. The results show that nouns are the most important features for Chinese texts and suitable combination of part-of-speech can lead to better classification performance.
Several sets of experiments were carried out to study the impact of automatic word segmentation errors on Chinese text classification. Comparison experiment of four word-based approaches was carried out and the results show that the performance was significantly reduced when using automatic word segmentation instead of manual word segmentation which means errors caused by automatic word segmentation have an obvious impact on classification performance.
Furthermore, we proposed an effective way of combining character-based (N-gram) and word-based approach for Chinese text classification. Uni-gram and bi-gram features are considered as the baseline model and then combined with word features of length greater than or equal to three. We further introduced a weight coefficient which can be used to give higher weights to word features. Experimental results show that our proposed approach achieved the highest performance.
In future work, as the growing of categories will increase the difficulty of text classification, extensive experimental evaluation using more texts in more categories will be studied. Another work is to improve classification performance of Optical Character Reader (OCR) texts. The digitization process of printed documents involves generating texts by an OCR system. However, OCR texts usually contain errors due to misrecognized characters. Therefore, it is necessary to investigate how to deal with these texts effectively.