Assessment of Linked Data Quality - 本文 Thesis 総合研究大学院大学学術情報リポジトリ甲1878本文

The proposed framework is TLDRet. We describe it in Chapter 4.

In Manual Intervention-based Error Detection, human assessors are engaged to assess the data quality. We review them to address the automatic error identifi-cation task comparison. On the other hand, the Particular-data-type-based Error Detection frameworks only focus towards a specific type of errors such as error in numerical data etc. We review them to address the error type irrespective errors finding. In Similar Data Source-based Error Detection, two or more data sources are required to identify the possible errors. On the other hand, the Ontology Enrichment-based Error Detection frameworks enhanced data quality. We review them to help our methodology. The below we discuss them more details.

(1.) Manual Intervention-based Error Detection−This kind of studies look into the Linked Dataset and then manually devise some rules to identify the errors [1, 59, 116]. Although the studies generate decent outcomes, they require domain-level expertise. However, finding of domain-level experts is not easy. Moreover, when such experts are found, the process is still costly.

Therefore, the manual-intervention based error detection studies are not easy to adapt for diverse datasets and impractical for large datasets.

The below we briefly discuss two of them.

• CLDQA[1] is “Crowdsourcing Linked Data Quality Assessment”. In this research, authors engaged crowdsourcing as a means to handle Linked Data quality assessment. They considered that this type of assessment is challenging to be solved automatically. They analyzed the most common errors encountered in Linked Data sources. The errors were classified into three categories: “Object Incorrect/Incompletely Extracted” − incor-rect object value holding triples, “Data Type Incorincor-rectly Extracted” − incorrect instance type holding resources, and “Falsely Interlinked”− in-correct or non-existing links holding triples. Based on the classification, authors distributed the tasks into two groups of assessors: (i) a contest targeting an expert crowd of researchers and Linked Data enthusiasts;

complemented by (ii) paid microtasks published on AmazonMechanical Turk. Authors evaluated that crowdsourcing-enabled quality assessment is a promising and affordable way to enhance the quality of Linked Data.

• TELDQ [59] is “Test-driven Evaluation of Linked Data Quality”. In this research, authors investigate a dataset, and creates some patterns that identify possible erroneous entries. In their work we need to know some

form of vocabularies, ontologies and knowledge bases, which they argue that it helps to ensure a basic level of quality. Their methodology is inspired by test-driven software development. That is, the methodology tries to guess “bad smells” of data and assess the data quality. The patterns are formed based on the “bad smells” which are some SPARQL query templates. Templates can be instantiated for different pattern according to the requirement.

(2.) Particular-data-type-based Error Detection−This kind of studies only check for error detection for particular data-type Linked Data [34,111]. Such as, Wienand et al. investigated to find error for numerical data-type Linked Data [111]. However, the particular-data-type based erroneous data findings ignore large amounts of (other-data-type) erroneous data.

The below we briefly discuss one of them.

• DINDD [111] is “Detecting Incorrect Numerical Data in DBpedia”. In this research, authors proposed a methods that can find numerical type of incorporate data. They applied unsupervised numerical outlier detec-tion technique. Their outlier detecdetec-tion was done by applying different methods such as Interquantile Range (IQR) [108], Kernel Density Esti-mation [78], and various dispersion estimators, combined with different semantic grouping methods.

(3.) Similar Data Source-based Error Detection This kind of studies try to find two or more data sources for same Linked Data resource and then compare the data to identify the error [19, 58, 74]. However, the similar data sources are not readily available for all kind of Linked Data resources.

Moreover, when such data sources are available, cross checking of them is not easy. Therefore, the similar-data-source based error detection studies face adaptation difficulties.

The below we describe one of them.

• LCRSCWDF [19] is “Learning Conflict Resolution Strategies for Cross-Language Wikipedia Data Fusion”. In this research, authors analyzed DBpedia dataset, and resolved conflicts for data values. They considered DBpedia 3.9 hold 2.46 billion facts, among which conflicting values are existed. Therefore, they used DBpedia entries from 119 different lan-guages, and resolved a conflicting values by applying an algorithm that

considers values written in different language versions. The resolution algorithm, which is customizable for each relation, is given with some ground truth, so that by analyzing the ground through, algorithm can decide which fact should be more correct than others.

(4.) Ontology Enrichment-based Error Detection This kind of studies also need to check the data manually [61, 64, 79, 102]. In some cases, ontology-enrichment can be done automatically such as the work done by Lehmann et al [79] . In this research, authors automatically typified Linked Data resources that do not hold type information; however their research focus was not for error detection. Our proposed framework can be adapted on the top of their proposal because we utilize type (i.e., Class) information as the input.

• TINRD [79] is “Type Inference on Noisy RDF Data”. This research does not fully focuses for error detection or quality assessment, rather it provides an ontology enhancement technique. However, we can con-sider the enhancement requirement is only necessary when the data are not completely defined. In that sense, TINRD can identify incomplete data, therefore we, to some extent, can consider it as a quality assess-ment framework. Anyway, in this research authors propose a statistical data type identification. To identify the data type, for each relation, au-thors checked incoming (subject) and outgoing (object) links or URIs.

Then considering those links (subjects and objects) and their used types (types of subject and types of objects), it calculates a probability of the type. Then, if any link (subject or object) that does not hold type infor-mation, but get linked with the relation it can typify the link with the probabilistic type.

2.3.2 Unsolved Issue

To assess Linked Data quality, in Section1.2.2, we addressed the problem−Wrong Entries in existing Linked Dataset. The below we explain the problem by analyzing the literature review.

(1.) automatic data quality assessment framework for Linked Data, ir-respective of their data types−here we explain the above stated research problem.

In our literature review, currently we do not find any Linked Data quality assessment framework that can automatically assess the data for any error possibilities. For example, work done in [1, 59, 116] is manual-intervention based error detection frameworks. Although they generate decent outcomes, they require domain-level expertise. Furthermore, work done in [19, 58, 74]

is similar data source-based error detection which require to find two or more data sources for same kind of data. However, the similar data sources are not readily available for all kind of Linked Data, or checking of them is not easy. Also, we understand that error detection framework should not only focus some particular error types [34, 111]. So, the contemporary researches can not assess Linked Dataset for wide-rage error possibilities. Therefore, we understand that identifying Linked Data errors for various type of errors is a open research issue.

Therefore, we should propose an error detection framework that can auto-matically assess Linked Dataset for any error possibilities, without requiring any same kind of dataset.

2.3.3 Our Proposal

On Linked Data quality assessment we propose framework ALDErrD [87].

(1.) automatic Linked Data error assessment

• proposal in brief −to address the unsolved issue, we propose an error detection framework that can automatically assess Linked Dataset for any error possibilities, without requiring any same kind of dataset.

• method in brief − in the proposed framework we consider that the same type of Linked Data should share the same kind of values. For example, height of a “Basketball Player” (i.e., type of data) should follow similar kind of values like other “Basketball Players”. Such as, we expect individuals who are Basketball Players to be taller.

The erroneous data values of Linked are measured by data outlier.

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ甲1878本文 (ページ 48-53)