Datasets - 本文 Thesis 総合研究大学院大学学術情報リポジトリ A1888本文

4.4 Experiment

4.4.3 Datasets

DF246 and OAEI2011 datasets contain portions of DBpedia, Freebase, NYTimes, and Geonames. Because ASL is a schema-independent system, DF246 is chosen due to its variety of schemas. In addition, OAEI2011 is used to compared ASL and other non-learning based systems. We set up a few adjustments for the data of these data sources

because they contain some information that is not related to the instance matching problem or that reduces the data quality. For DBpedia, we use only the triples that are loaded into DBpedia’s SPARQL endpoint¹ and describe ‘resource’ instances². For Freebase, we only consider the ‘topic’ instances³ and the triple with ‘m’ identifier⁴. In addition, for DBpedia and Freebase, we integrate the redirect instances into their reference. Next, we describe the datasets in detail.

4.4.3.1 DF246 dataset

DF246 is created for evaluating the generality of ASL, including the accuracy and the scalability. This dataset contains 246 subsets combined from 246 source repositories and 35 target repositories. The creation process for this dataset consists of the two following steps.

Step 1: Select source repositories from DBpedia. To have many repositories with different schemas, we divide DBpedia into smaller parts based on existing domain information of the instances. We split the repositories with assumption that RS is a single domain repository. Since no clear definition for ‘domain’ exists due to the natural hierarchical relations of concepts, our assumption is close to that R_S does not contain so different domains in terms of schema. For example, the schemata of university and college (educational institution domain) are highly equivalent, whereas those of ship and train (transportation domain) are very different. In fact, this assumption is not strict.

Dividing the source repository into separable domains is a feasible inexpensive task. In DBpedia, an instance may belong to different domains that are involved in a “parent-child” relation. Therefore, a domain, which consists of a set of instances, may have one parent and many children.

Since we split the repositories with above assumption, we merge domain C₁ intoC₂ if the schemas of them are similar. In order to achieve that objective, we conducted a schema conformancecheck for each sub-domain relation⁵. DomainC₁ is conformable to its parentC₂ if every property psatisfies the condition in Equation 4.7.

|f(p, C₁)−f(p, C₂)| ≤0.5 (4.7) f(p, Ck) = |{x|hs, p, oi ∈x, x∈C_k}|

|Ck|

1http://wiki.dbpedia.org/DatasetsLoaded

2The triples whose RDF subjects start with http://dbpedia.-org/resource/

3The instances that contains a triple whose value is http://rdf.freebase.com/ns/common.topic

4The triples whose RDF subjects start with http://freebase.-com/ns/m

5http://mappings.dbpedia.org/server/ontology/classes/

Table 4.2: Overview of DF246 dataset

# Sources # Targets # Expected

Min 127 1,804 126

Max 602,293 25,625,291 597,566 Avg. 10,754.1 2,002,283.6 9,339.6

Using the checking result together with the original sub-domain specification, we derive the priority for each domain to classify the instances. That is, a domain is initially given higher priority than its parent in order to diversify the dataset. However, if a domain is totally conformable to its parent, the higher priority is given to the parent instead.

In DBpedia, we extract the domain information of instances by using the property rdf:type. Among 398 domains with respective DBpedia 3.9 (English) instances, 46 classes are conformable (e.g., governor and politician, bone and anatomical structure). Note that identical domains are counted as one (e.g., http://dbpedia.org/ontology/Place and http://schema.org/Place). Also, not every instance is classifiable because the type in-formation of 584,520 instances is missing. Finally, 352 domains are unconformable and resulted in 352 repositories.

Step 2: Select target repositories from Freebase. We first separate Freebase (2013/09/03) into different repositories by using the domain information of the instances, given by the property fb:type⁶. After that, we map each source repository into one target that shares the most coreferent instances, as declared in the gold standard⁷. Finally, we remove the mappings having less than 100 coreferent pairs.

The first task extracted 35 different repositories, the second task obtained 2,668,372 standard pairs, and the last task produced 246 subsets. The 246 repositories constructed from DBpedia, 35 repositories constructed from Freebase, and the mappings between them are listed in Appendix A, B, and C, respectively.

We sort these subsets increasingly by their size. The size of a subset is the product of the number of instances in the source and the target repositories. Table 4.2 is the summary of the number of instances in these subsets. In this table, #Sources and #Targets represent the number of instances in source and target repositories. #Expected is the number of actual coreferent pairs. Note that each Freebase instance refers to a topic and can belong to multiple sibling domains (e.g., book, music, and location). Therefore, overlaps are found between almost every two of the 35 extracted repositories.

6http://rdf.freebase.com/ns/type.object.type

7http://downloads.dbpedia.org/3.9/links/freebase_links.nt.bz2

Table 4.3: Overview of OAEI2011 dataset

ID Target Domain # Sources # Expected # Originals

D1 DBpedia location 3,840 1,917 1,920

D2 DBpedia organization 6,088 1,922 1,949

D3 DBpedia people 9,958 4,964 4,977

D4 Freebase location 3,840 1,920 1,920

D5 Freebase organization 6,088 3,001 3,044

D6 Freebase people 9,958 4,979 4,979

D7 Geonames location 3,840 1,729 1,789

4.4.3.2 OAEI2011 dataset

OAEI2011 is used to compare ASL with other manual systems. The gold standard of this dataset was presented at the OAEI 2011 instance matching track [37]. The source repositories are three separated domains of NYTimes: location, organization, and people. The targets are full DBpedia, Freebase, and Geonames. To construct the repositories, we use DBpedia 3.7 (English), Freebase 2013/09/03, and Geonames 2014/02. The numbers of instances are 4,183,361 for DBpedia, 40,358,162 for Freebase, and 8,514,201 for Geonames.

An important note for this dataset is that we detected some incorrect and missing pairs in the previous standard as compared to the current actual data. In evaluation, we do not count the incorrect, the missing, and erroneous pairs. Consequently, the numbers of expected pairs are slightly different from the original pairs published in 2011. We summarize the size of the OAEI2011 dataset in Table 4.3.

Compared to the more recent OAEI datasets, the OAEI2011 dataset is more suitable for our experiment because of the similar benchmark objective. In year 2012, OAEI provided the training and test split for OAEI2011 dataset in order to evaluate learning-based systems. The OAEI dataset of year 2013 and 2014 are small and designed for special challenges (e.g., multiple languages, string distortion, value deletion, and non-uniform representations). Since ASL focuses on real and large data, OAEI2011 is suitable for the experiment.

All datasets, including the repositories, the labels for expected pairs, and their detailed descriptions as well as the ASL source and the full experimental results, are available at http://ri-www.nii.ac.jp/ASL/.

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ A1888本文 (ページ 92-95)