Experiment - 本文 Thesis 総合研究大学院大学学術情報リポジトリ A1888本文

3.3.1 Experiment target

We evaluateScSLIN T in the aspect of memory and time efficiency. For the first issue, we simply deployScSLIN T on a desktop computer with a usual computational power.

For the second issue, we measure the time for building the repositories’ index, and the detailed running time of each component. We also compare the runtime of ScSLIN T and other state-of-the-art frameworks, including SILK [155] and LIMES [98]. Very large datasets are used for this experiment in order to verify the scalability practically.

3.3.2 Datasets

We use seven standard datasets selected for OAEI 2011 instance matching challenge.

The repositories related to these datasets are NYTimes¹, DBpedia², Geonames³, and Freebase⁴. The source repository is NYTimes and the targets are the others. For these datasets, NYTimes is divided into three subsets regarding to three data domains:

location (loc), organization (org), and people (peo). For DBpedia and Freebase, every data domain of NYTimes form a dataset. For Geonames, only one dataset for location is created. Therefore, in total there are seven datasets. We also use another dataset connecting DBpedia and Freebase. The summary of all datasets are provided in Table 3.2. The only criterion for selecting the datasets is the scale. The first seven datasets are billion scale and the last dataset is trillion scale.

Some of previous systems reported experiments on the first seven datasets. However, they utilized simplified version of the datasets rather than the original data. Concretely,

1NYTimes version 2014/02

2DBpedia version 3.9

3Geonames version 2014/02

4Freebase version 2013/09/03

for DBpedia and Freebase, only the classes related to the subsets of NYTimes are used.

They are people, location, and organization classes of DBpedia and Freebase. In fact, the ground truth of these datasets show that not only the instances in the appropriate class are the expected coreferences. That is, other experiment use a simplified version of the data for reducing the complexity, although the recall definitely drops. Some other experiments utilized a more simple datasets, in which only the coreferent instances are stored in the input.

The size of input repository is not only the matter of scalability. As we discussed in Section 1.1, the scalability also brings the problem of heterogeneity and ambiguity.

However, in our experiment, we focus on the scalability of the frameworks.

3.3.3 Experiment settings

For usingScSLIN T, we install the following configuration for its components.

• Property alignment. We do not map the property manually in order to make ScSLIN T performe the property comparison task. We define using Jaccard index (Section 2.3.1.1) for this task. Two properties are considered as equivalent if the Jaccard similarity of them is higher than 0.75.

• Blocking. We apply a simple token-based blocking techniques. We also use the result of property mappings for this component. Given the list of property mappingsM, two instance are considered as a candidate if they share at least the first token of the values described by any mapping inM.

• Similarity function generator. We apply two complex similarity measures for string type, Levenshtein and TFIDF cosine. That is, for each string mappings, two similarity functions are created. For other data types, we simply use only the exact matching.

• Specification creator. We use specification-based instance matching for this ex-periment. That is, we need to declare the thresholds for determiner and similarity aggregator. However, because we do not focus on testing the accuracy, we only declare the similarity aggregator. We choose average aggregation as the default for the experiment. The two last components work on the basic of this specification.

The computer used in the experiment is equipped with one Intel core i7 4770K CPU and 16GB memory. The CPU is a quad-cores architecture with virtual dual-core technology for each physical core. Therefore, we enable multi-threading with 8 threads for the similarity aggregator component.

3.3.4 Results

We report the runtime result of ScSLIN T on two aspects: the time for building the indexes and for the whole matching process.

3.3.4.1 Indexing

The result of runtime for building the indexes is as follows.

• NYTimes location: 0.069 second.

• NYTimes organization: 1.006 seconds.

• NYTimes people: 1.908 seconds.

• Geonames: 1,362 seconds.

• DBpedia: 2,874 seconds.

• Freebase: 25,200 seconds.

In general, the time for building the index is very fast and is linear to the size of the repository. ScSLIN T creates the index for Freebase in one hour on a usual desktop computer. A worthy note is that the size of Freebase is 198GB and the indexes cost in total 74GB. That is, 66% of the original data has been compressed using the index structures. It can be said thatScSLIN T offers fast index that is sufficient for instance matching tasks and compact in term of size.

3.3.4.2 Matching process

The detailed runtime of the matching process is reported in Table 3.3. In this table, the second and the third column display the number of candidates and similarity functions.

These two factors are the source of complexity for other similarity aggregator component.

The other columns are the runtime of the respective components. According to this table, the most expensive component is similarity aggregator because it has to execute many string similarity measures. That is, the role of candidate generator is very important for reducing the number of candidates.

The longest matching time for NYTimes data is 36 minutes, on D7, which requires 11.16×10⁹ comparisons. On the simplified version of this dataset, which is 2000 times smaller, the required time for LIMES and SILK, which are among the state-of-the-art

Table 3.3: Runtime ofScSLIN T. (Unit: second)

Dataset #Candidates #Similarity Property Blocking Similarity (×10⁶) functions alignment aggregator

D1 32.2 12 37 7 70

D2 38.2 25 43 8 268

D3 61.7 17 46 11 404

D4 46.9 22 46 11 251

D5 222.7 23 14 111 641

D6 357.4 16 14 268 1023

D7 620.1 18 15 507 1578

D8 45,286.4 12 52 11,807 345,495

frameworks, is almost 30 minutes and 9.5 hours, respectively [98]. Because there was no report of an experiment on a billion scale dataset like D1 to D7, we tried to test SILK [155] and LIMES [98] frameworks on these datasets. The result is that LIMES and SILK fail to finish the matching task even within 10 times longer period for each dataset. In such cases, we terminate the processes after such period.

The result ofScSLIN T on D8 is impressive. ScSLIN T can finish this task and therefore it is not only the first framework tested on billion scale datasets (D1 to D7), but also the first framework successfully performed the trillion scale matching task. The total time to finish the matching on D8 is 99 hours. ScSLIN T can process huge data like D8 using a usual computer. That is, the framework expresses its high scalability and can far benefit from being deployed with other big data processing framework or on a powerful machine.

ドキュメント内本文 Thesis 総合研究大学院大学学術情報リポジトリ A1888本文 (ページ 77-81)