Analyses of Efficiency and Effectiveness - A Study on Keyword Search over Hybrid XML-Relational

6. Experiments

6.2 Analyses of Efficiency and Effectiveness

According to the pruning rule (Section 5.1.3), we judge the join-trees by calculating the LCAs of keyword-related XMLs during join-tree generation. Consequently, we test how many invalid join-trees can be deleted. Fig.6-3 (a) is a bar chart to show the propor-tion of pruned join-trees to total join-trees. It shows about 77% join-trees are pruned not to be executed when the query has 5 keywords. For the total testing set of these 80 queries, we can achieve 62.76% pruning ratio. Fig.6-3 (b) shows the details of the join-tree numbers. To do keyword search of the testing 80 queries, 1939 join-join-trees are generated, and 722 join-trees are executed to get the final results. 1217 join-trees are pruned, because they cannot conduce to any result. It can be asserted that the join-tree pruning method makes keyword search on XML-RDB be more efficient.

(a) Total

Join-Trees

Total Pruned Join-Trees

Total Executed Join-Trees

The Number

of Keyword Pruned/Total 149 8 141 2(20 queries) 5.37%

292 120 172 3(20 queries) 41.10%

444 273 171 4(20 queries) 61.49%

1054 815 239 5(20 queries) 77.32%

1939 1217 722 2~5

(80 queries) 62.76%

(b)

Fig.6-5 The ratio of the pruned join-trees in the total join-trees

Fig.6-6 The average time of join-tree generation

Fig.6-7 The execution time of queries

Next, we tested the time of join-tree generation. The time is counted from when the keywords are inputted to when all valid join-trees are outputted. We averaged the times of those queries with the same number of join-trees. In Fig.6-6, the number in the abscissa is that of total join-trees including pruned join-trees and executed join-trees.

Fig.6-6 shows when the number of join-trees increase, the average time of join-tree generation is also increased. The worst result is that the generation time is over 3 seconds when 105 join-trees are generated. However it costs a little time to judge

0 0.5 1 1.5 2 2.5 3 3.5

4 9 13 15 18 24 27 40 64 70 105

Time (sec.)

The number of join-trees for each query Average time of join-tree generation (sec.)

0 100 200 300 400

2222222222333333333344444444445555555555

Time (sec.)

The number of keywords in queries Execution time of queries (sec.)

numerous join-trees, join-tree generation is efficient for the entire process of keyword search on XML-RDB to avoid executing the invalid join-trees.

Finally, in Fig.6-7 we showed the execution time of all queries that are the 80 2~5-keywords queries. The time is counted from when all valid join-trees are inputted into Query Generator to when all valid join-trees are executed to get all results. Here, Fig.6-7 excludes the time of showing all results in User Interface. Most of those queries can execute all join-trees in 5 seconds on average. But there are 3 queries that cost more than one minute. The reason of these accidental cases is to do XRjoin using plain repetition that computed more than one million LCAs. There are researches of some efficient algorithms to save the XML computing time, which are like the stack-based algorithm of XRANK [11] and so on.

z Evaluation of XRjoin execution

For different join-trees, different XRjoins will be executed. As a part of join-trees, the execution time of each XRjoin type was picked up and calculated. Fig.6-8 shows the experimental results of average execution times of XRjoin type 1 (XRjoin 1), XRjoin type 2 (XRjoin 2), XRjoin type 3 (XRjoin 3) and EXRjoin (XRjoin 4). In Fig.6-8, the abscissa is the number of the keywords that are included in the XRjoin. For example, XRjoin 4 is shown as light blue lines, which contains 2 or 3 or 4 keywords hitting the hybrid entity and the relational entity.

When 80 queries are input into the hybrid XML-RDB system, we computed the av-erage execution time of all XRjoins shown in Fig.6-8 (a). Then we classified the two groups of the worst situation and the normal situation mentioned above.

Fig.6-8 (b) shows the average execution time of XRjoins under the worst situation.

As expected, the times of XRjoin 1, XRjoin 2 and XRjoin 3 are inexpensive. XRjoin 4 costs over 13 seconds, because the experiment includes four cases to calculate abundant LCAs over 20 seconds. In the worst case, we observed that XRjoin 4 calculated 1,167,399 LCAs over 3 minutes.

(a) Total of Query Group 1 and Group 2 (80 queries)

(b) Query Group 1 for the worst-case test (40 queries)

0 2 4 6 8 10

2 3 4 5

Average Execution Time (sec.)

Keyword Number

Average Execution Time for Each XRjoin (Total)

XRjoin 1 XRjoin 2 XRjoin 3 XRjoin 4

0 2 4 6 8 10 12 14

2 3 4 5

Average Execution Time (sec.)

Keyword Number

Average Execution Time for Each XRjoin (Query Group 1)

XRjoin 1 XRjoin 2 XRjoin 3 XRjoin 4

0 2 4 6 8 10

2 3 4 5

Average Execution Time (sec.)

Keyword Number

Average Execution Time for Each XRjoin (Query Group 2)

XRjoin 1 XRjoin 2 XRjoin 3 XRjoin 4

Fig.6-8 (c) shows the average execution time of XRjoins under the normal situation.

As expected, the times of all XRjoins are small. XRjoin 1, XRjoin 2, XRjoin 3 and XRjoin 4 with keywords less than 4 are executed within 1 second, and XRjoin 4 with 4 keywords costs 2.635 seconds.

z Evaluation of hybrid results

To evaluate the quality of hybrid results, we presented the results of keyword search on our XML-RDB system using an interface of a web browser. Fig.6-9 shows an example when Query = {sanjay, link} is given and the join-tree (Conference->Paper<-Authors) is selected. The original data about “link” is a session title of conference VLDB 2004, and “sanjay” is the first name of an author.

As shown in Fig.6-9, the results are derived from the join-tree (in the red box) which does XRjoin twice and natural join once (in the blue box). These results are listed by order of the sum of two sub-trees’ scores that are computed when XRjoin works.

The answer of No.1 in Fig.6-9 means that an author named “sanjay Agrawal” has written a paper (the id is underlined by blue line) in the conference VLDB 2004, which has a session hit by “link”.

The answer of No.2 in Fig.6-9 means that an author named “sanjay E. Sama” has written a paper (the id is underlined by blue line) in the conference VLDB 2004, which has a session hit by “link”. These answers cannot be obtained by existing techniques.

Fig.6-10 shows another example when Query = {security, privacy, model} is given and the join-tree (Conference->Paper->Citation<-Paper) is selected. The original data about “security” and “privacy” hit the XML sub-trees of a conference VLDB 1978, and

“model” is the title of a paper, which cited the paper of the conference VLDB 1978 or was cited by the paper of VLDB 1978. In Fig.6-10, the red arrows show the relationships between papers.

The answer of No.1 in Fig.6-10 means the paper including “model” was cited by a paper in the session of VLDB 1978 hit by “security” and “privacy” at the same time.

Fig.6-11 shows the last example when Query = {machine, database, kitsuregawa} is given and the join-tree (Conference->Paper<-Authors) is selected. Different from Query

= {sanjay, link}, this case is three entities Conference, Paper, Authors includes three keywords respectively. The original data about “machine” hit the XML sub-trees of a conference. The keyword “database” is the title of a paper. The keyword “kitsuregawa”, is the last name of an author. The results in Fig.6-11 consist of two kinds of XML sub-trees and the relational data of papers.

The answer of No.1 in Fig.6-11 means that an author named “Masaru Kitsuregawa”

has written a paper (the id is underlined by blue line) in the conference ICDE 1990. The paper by “Kitsuregawa” and the paper including “database” are in the same session of ICDE 1990 hit by “machine”.

Fig.6-9 Hybrid results of Query = {link, sanjay}

Fig.6-10 Hybrid results of Query = {security, privacy, model}

Fig.6-11 Hybrid results of Query = {machine, database, kitsuregawa}

ドキュメント内 A Study on Keyword Search over Hybrid XML-Relational Databases (ページ 84-94)