CWs describe important parts of the text which have various usage over the text mining. In this Chapter, we propose a framework that identifies CWs for a natural language query. Here we outlined that how a large text corpus can be exploited to identify CWs: particularly in feature calculation and scalability handling. We also showed experimental results and their findings. Furthermore, we also outline how CWs can be exploited to devise keywords for BoTLRet and TLDRet by adapting a state-of-the-art entity linking framework. Thereby, we discuss that together with LiCord and the state-of-the-art entity linking framework how we can automatically devise keywords for a natural query.
ALDErrD: An Information Assessment Framework over Linked Data
We first describe about its introduction (Section 6.1). Then we we describe the basic idea of our proposed framework (Section 6.2). Next, we describe the pro-posed ALDErrD in details (Section 6.3 ). We show results of implementing our proposal through experimental results and discussion (Section 6.4). Later, we explain the proposed framework that how it solved the problem that existed in the contemporary systems (Section 6.5) Finally, in Section 6.6 we summarize the Chapter.
6.1 Introduction
Linked Data currently hold a vast amount of knowledge. However, Linked Data suffer in data quality, and this poor data quality brings the need to identify er-roneous data. Because manual erer-roneous data checking is impractical, automatic erroneous data detection is necessary.
According to the data publishing guidelines of Linked Data, data should use (al-ready defined) ontology which populates type-annotated Linked Data. Usually, the data type annotation helps in understanding the data. However, in our observa-tion, the data type annotation could be used to identify erroneous data. Therefore, to automatically identify possible erroneous data over the type-annotated Linked Data, we propose a framework that uses a novel nearest-neighbor based error detection technique. We conduct experiments of our framework on DBpedia, a type-annotated Linked Data dataset, and found that our framework shows better performance of error detection in comparison with state-of-the-art framework.
6.1.1 Motivation
Usually, Linked Data are generated either manually or automatically. However, both generation procedures have some flaw which produce erroneous Linked Data entries. Usually the manual intervention-based procedure generates more cleaner data, however, such data also contain erroneous entries because of human errors.
Moreover, such data could be generated from multiple sources which sometime differ one another. Example of such data PubMed, DrugBank etc. On the other hand, the automatic procedure generates more data than the manual intervention-based procedure because of easy and automatic procedure. However, such data are more prone towards erroneous data gathering. One reason of such erroneous data gathering is the wrong contents in main data source, from which Linked Data are automatically extracted. If the main data source contain wrong entries, the generated Linked Data also extract wrong entries. The example of such wrong entries could be wrong entries of DBpedia that are extracted because of wrong Wikipedia contents. Moreover, erroneous entries could also be extracted because of problematic data extraction. Therefore, according to Linked Data generation procedures, they are potential to contain errors [58, 87]. However, to use such
Linked Data effectively, data consumers commonly expect to easily retrieve high-quality data. This brings the need to identify erroneous data in the Linked Data.
Usually, manual erroneous data checking is impractical. Therefore, automatic erroneous data detection is necessary.
On the other hand, according to the best-practice data publishing guidelines [47]
of Linked Data, data should use (already defined) ontology which populates type-annotated Linked Data (See Chapter 2 in Section 2.1.3 and Linked Data part of in Figure 1.1). In the Figure, the Linked Data part hold type annotation for instance “rc:cygri” using “rdf:type” as “Person”. In real world, DBpedia is a type-annotated Linked Data dataset. Usually, the data type annotation helps in understanding the data. However, in our observation, the data type annotation could be used to identify erroneous data. The intuition behind this assumption is that the same type of Linked Data resources should share the same kind of values. Therefore, if data values of some Linked Data go beyond the usual pattern or trend of other same type of Linked Data, we consider them as erroneous data.
The assumption fully comply the data outlier detection, which is a common in erroneous data identification. Chapter2has already outlined some common outlier detection methods, we will utilize some of them in current quality assessment.
However, it is worth mentioning that this outlier-based error detection might not be always true, but it gives opportunity to check the data to find erroneous data over the type-annotated Linked Data.
In the past, some studies have dealt with erroneous data findings in the Linked Data. However, these studies have their own limitations. For example, some re-quire Linked Data domain-level expertise [1,59,116]. Some require another similar data source [19, 58, 74], are not suitable for diverse datasets, or are impractical for large datasets. Other works are for specific data types and ignore the errors for the remaining data types [34, 111].
In this study, we are motivated to investigate the above mentioned issues.
The previous two chapters (Chapter 3 and 4) have dealt with Linked Data infor-mation access, which is a big issue in Linked Data success. In this Chapter, we will investigate another crucial Linked Data issue − Linked Data quality assessment
− that can make Linked Data a success [8]. We understand that users will be interested to access the assessed data, so that they can rely upon on them.
6.1.2 Contributions
We propose a framework to identify possible candidate of erroneous data over the type-annotated Linked Data. The framework is namedAuto Linked Data Error DetectorALDErrD [87] which automatically detect potential error patterns and predict possible candidate of erroneous data. The main features of our proposed framework ALDErrD are the following:
(I.) It is free from manual intervention
(II.) It does not require domain-level expertise
(III.) It does not require other data sources of the same kind (IV.) It is suitable for any type of data