Summary - 本文 Thesis 総合研究大学院大学学術情報リポジトリ A1888本文

7

Conclusion

This chapter summarizes our contributions and discuss the future work.

7.1 Discussion

Matching different instances of the same real-world entity is an important problem in the current period of data explosion. In this dissertation, we present a series of solutions to instance matching that target on the issues of heterogeneity, ambiguity, and scalability.

The solutions are designed for two different scenarios: supervised and non-supervised.

They are reflected by the framework, method, or algorithms, and implemented in dif-ferent systems. This section discusses the achievements of the dissertation.

• We develop the core of instance matching systems,ScSLIN T framework. ScSLIN T is designed for scalability, extensibility, and portability. The architecture ofScSLIN T is based on the general workflow of similarity-based instance matching systems.

It provides the interface of property alignment, blocking, similarity function and aggregation, and the determiner. ScSLIN T is portable for different input data such as relation database and linked data repository. ScSLIN T is compatible with classification-based and specification-based matching by adjusting the deter-miner. Furthermore, ScSLIN T is also compatible with the supervised matching scenario by modifying the specification creator. We evaluated ScSLIN T on real large datasets, including a huge dataset linking the whole of DBpedia and Freebase.

ScSLIN T is far better than existing frameworks in both terms of memory and time consumption. Concretely, the speed of ScSLIN T is at least 10 times faster andScSLIN T is the first framework evaluated with the full dataset of OAEI 2011 instance matching challenge as well as a trillion scale dataset. We further propose two specification-based systems, ASL and ScLink, whose scalability is strongly supported by ScSLIN T.

• In order to cover any repository with any domain and schema, we develop ASL, an automatic schema-independent instance matching system. ASLworks indepen-dently with the schema because it aligns the properties using a simple but effective heuristic. ASL also adopts the principle of stable marriage problem for finding the stable coreferences among the pairwise instances. For evaluating ASL, we constructed a diversified dataset from the links between DBpedia and Freebase.

246 subsets of different schemas are acquired. The experiments on this dataset demonstrated the schema-independent capability of ASL. Compared to recent state-of-the-art systems, ASL significantly outperforms in terms of performance and also in processing time. Another finding from the experiments is the usefulness of blocking technique using only the first token. It is sufficient to use this tech-nique for real dataset without losing a considerable amount of pair completeness, in comparison to weighting and ranking methods.

• Targeting on the ambiguity, we proposeModified-BM25similarity metric for string values. Modified-BM25combines the measurements on different aspects of the given strings. First, it leverages the advantage of Jaccard index on set similarity.

Second, it applies state-of-the-art BM25 weighting scheme. Third, unlike other metrics, Modified-BM25takes the relative token’s order into account in order to improve the disambiguation ability. The combination of above factors is not a usual linear aggregation rather than a probability conjunction. Modified-BM25is included inScLinkand is evaluated in the context of this system. The experiment results on many datasets including real and large repositories showed the drastic improvements when applyingModified-BM25, compared to using only other simi-larities, such as TF-IDF cosine and Levenshtein.

• In order to facilitate the scalable supervised instance matching, we proposeScLink system. The originality of ScLink includes three main points. First, while other supervised systems skip the learning of blocking scheme, ScLink follows a more reasonable workflow, in which the blocking scheme is optimized for a better re-duction on the number of candidates. Second,ScLinkis equipped withminBlock algorithm that learns the optimal blocking scheme. Third,ScLinkusescLearnfor finding the specification that works best for the input repositories. cLearnutilizes a heuristic-based searching algorithm on the basis of apriori principle. The ex-periments on 15 datasets demonstrated the performance ofminBlock andcLearn algorithms. minBlock helps generate a compact set of candidates with losing al-most none of pair completeness. Meanwhile, cLearn finds the specification that significantly improves the F1 score against state-of-the-art algorithms. ScLink also consistently outperforms other systems in many comparisons.

• While other systems simply apply traditional classier for instance matching, the ranking of instance pairs with respective to each source instance is ignored. In order to benefit the generalization capability of machine learning classification and include ranking nature of instance matching, we proposeR2M feature. R2M represents the ranking of instance pairs by probabilistic values. The model gener-atingR2M is trained by optimization algorithms. R2M is evaluated with different state-of-the-art classifiers, including Linear regression, J48 decision tree, SVM, and Random forest. R2M demonstrated the outstanding ability of contribution to the generalization of all tested classifiers. The inclusion of R2M also signifi-cantly enhances the base classifiers. Moreover, the best classifier improved byR2M significantly outperformsScLink and other systems.

• We developed different systems for various scenarios of instance matching. In sum-mary, it is recommended to useASLfor non-supervised matching tasks, where the

schema-independence is required. TheScLinkand classifier withR2M are suitable for supervised tasks. The accuracy of classifier with R2M is better than ScLink but ScLink is more scalable. Therefore, the application of ScLink and R2M is based on the consideration of computational cost and the expected accuracy.

• It is possible to combine specification-based and classification-based methods in the scenario of supervised matching (e.g., the combination of ScLink and R2M).

Specification learning is responsible for finding the optimal similarity functions and the classifier learning can use the result of those functions as the input similarity vector. Such combination could reduce the complexity of the classifier. However, it is potentially redundant and conflict because both of the learning phases focus on the same objective. The training of classifiers already includes the selection of optimal similarity functions. Also, the usefulness of similarities is treated differ-ently in accordance with each learning mechanism. In other words, the selection of the first learning phase can affect the performance of the second phase. There-fore, putting both learning phases together requires a study effort on the model combination.

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ A1888本文 (ページ 154-158)