• 検索結果がありません。

Algorithm for Computing Skyline ObjectSet on Numerical Databases

N/A
N/A
Protected

Academic year: 2021

シェア "Algorithm for Computing Skyline ObjectSet on Numerical Databases"

Copied!
4
0
0

読み込み中.... (全文を見る)

全文

(1)

Algorithm for Computing Skyline ObjectSet on Numerical Databases

Md. Anisuzzaman Siddique

Taufik Djatna

Yasuhiko Morimoto

Hiroshima University

1-7-1 Kagamiyama, Higashi-Hiroshima, 739-8521, Japan

{d074370@, taufikdjatna@, morimoto@mis.}hiroshima-u.ac.jp, +81-82-424-5579

1

Introduction

The skyline query has recently attracted considerable at-tention due to its importance in many applications such as multi-criteria decision making, data mining, and user pref-erence queries [7]. Given a set of attributes, and a database

D, an object p is said to be in skyline of D if there is no

object q in D such that q is better than p in all dimension. If there exist such a q, then we say that p is dominated by q or

q dominates p.

Figure 1 shows a typical example of conventional sky-line. The table in the figure is a list of hotels, each of which contains two numerical attributes distance and price, for on-line booking. In the list, the best choice to a client comes from the skyline i.e. any of {h1, h3, h4} (See Figure 1

(b)). A number of efficient algorithms for computing sky-line have been reported in the literature [1, 2, 5, 6, 3].

Though the skyline queries give us important informa-tion in the database, they are insufficient if a customer is interested for more than one object at a time in a single query. For example, if a customer has to reserve several different hotels, she or he has to find a best set of hotels. This paper proposes new approach to create skyline for objectsets that can give significant information for the set optimization problems.

Motivated Example

Assume that an organizing committee is going to or-ganize a conference/workshop and for all participants they need to reserve three hotels around the conference venue as we discuss in Figure 1. If the organizer select three hotels

{h1, h3, h4} using conventional skyline method might

be a wrong decision. Because participants don’t want to stay at hotel h1, which is long far from the conference

venue and the organizer also don’t want to reserve hotel

h4 because of high price. So it is clear that conventional

skyline doesn’t provide sufficient information for group choice. To solve this problem we propose new function called skyline objectset query.

ID Price Dist h1 3 8 h2 5 4 h3 4 3 h4 9 2 h5 7 3

Figure 1. Skyline Example

Figure 2 (a) is a list of 3-objectsets, in which all of the combinations of three hotels are listed. The “h123” is a set

of{h1,h2,h3}. “Dis” and “Pri” of “h123” is the sum of the

“distance” and “price”, resp., of hotels in the set. If we are looking for best three hotels in a certain criterion, the skyline objectset with the size of 3 contains the preferable answer. As we can see in Figure 2 (b), we have five alterna-tives{h123, h135, h235, h234, h345} in the database.

To the best of our knowledge, the problem of skyline objectset has not been investigated by any of the previous related work.

The contributions of this paper are as follows: i) We in-troduce new concept skyline objectset and ii) develop an ef-ficient algorithm for computing the skyline objectset. And, iii) we performed intensive experiments.

2

Skyline ObjectSet Problem

Given a database having n records. A user specifies the size of objectset s and a subspace having k attributes among attributes in the database. In the rest of this paper, we con-sider the projected database D having k attributes and n records without losing generality of the problem.

   

(2)

ID Pri Dis ID Pri Dis h123 12 15 h145 19 13 h124 17 14 h234 18 9 h125 15 15 h235 16 10 h134 16 13 h245 21 9 h135 14 14 h345 20 8

Figure 2. ObjectSet Skyline

Definition 1 (s-ObjectSet) A s-objectset is a set of s

records among records in D.

We assume s is a relatively small number such that 2≤ s ≤ 10 in this paper though we can compute s-objectsets for much larger s. Let S be a set of all s-objectsets in D. Note that the number of s-objectsets in D is nCs = (n−s)!s!n! ,

which we denote the number |S|. Now we consider a database of S on the k dimensional space of D. Each record of the database is an s-objectset in D and the value of each attribute (each dimension) is the sum of the s records in the objectset.

Definition 2 (Dominating objectset) An objectset p ∈ S

is said to dominate another objectset q ∈ S if p is same or better than q on all attributes and better than on at lease one attribute.

Definition 3 (Skyline ObjectSet) An objectset p ∈ S is

said to be skyline objectset if p is not dominated by any other objectset in S.

The problem of skyline objectset aims to find those sky-line objectsets and help users to take decision within the list of objectsets.

3

Algorithm for Skyline ObjectSet

After computing all the records in S, the problem of sky-line objectset can be solved by conventional skysky-line query algorithms. However, since there are nCs = (n−s)!s!n!

records in S, it is not affordable to compute all the records in S. In this paper, we consider an algorithm for finding all skyline objectsets in S on the convex hull of S efficiently without computing all the records in S.

3.1

Touching Oracle

Each record in S can be represented as a k-dimensional point x = (x1, x2, ..., xk) where coordinate value of the

Figure 3. Convex Hull and Touching Oracle

Table 1. Inner Product with Tangent Lines a 1, a) 2, a) 1,2, a) h1 -3 -8 -80 h2 -5 -4 -68 h3 -4 -3 -53 h4 -9 -2 -86 h5 -7 -3 -77

point are values of k attributes of the record. Therefore, the database S is a set of |S| points in the k-dimensional space. Points on the convex hull of the set of |S| points can be computed efficiently by, so called, touching oracle function.

Touching oracle function finds the tangent point of the convex hull of the set of|S| points and a hyperplane whose normal vector is Θ.

Example

Consider the examples of Figure 1 and Figure 2 again. There are 10 points in |S| if the size of an objectset is three. Since there are two attributes in the database, those 10 points are in two-dimensional space as in Figure 2 (b). The dotted polygon in Figure 3 is the convex hull of the 10 points. In the example, there are five records in the origi-nal databases as in Figure 1. Each of the five record is also represented as a two-dimensional point, which we call an atomic point and denote a.

Assume there is a line whose normal vector is Θ1 =

(−1, 0) in the two-dimensional space. In order to find the tangent point with the line and the convex hull without pre-computing all points in S, we compute inner products of the normal vector and each atomic points, (Θ1, a) as in the first

column of Table 1. We choose the top three inner products, i.e.,{h1, h2, h3}. Those top three inner products composes

the tangent point (12, 15), which is the 3-objectset, h123.

Similarly, for a line with Θ2 = (0,−1), we can find

3-objectset{h3, h4, h5} is the top three in the inner products,

(Θ2, a). It is the tangent point (20, 8).

Note that once we specify a normal vector, we can find

(3)

the tangent point with convex hull, which is a skyline s-objectset, in O(n) by the touching oracle function.

3.2

Convex Hull Search

First of all, we compute initial k tangent points that can be computed by touching oracle with initial k vectors Θ = 1, θ2, ..., θk) where θi =−1 if i = x, otherwise θi = 0

for each x = 1, ..., k.

After computing the initial k tangent points, we have the initial facet (k-dimensional hyperplane). Next, we compute the normal vector of the facet. In the example above, we have initial two tangent points p1 = (12, 15) and p2 =

(20, 8) with initial two vectors (−1, 0) and (0, −1). Using the facet containing the two initial points, we can compute the normal vector of the facet as Θ1,2 = ((12−20), −(15−

8)). Using the normal vector, we can find the tangent point

h235, which is (16, 10). The new tangent point divide the

initial facet into two facets. We recursively compute tan-gent point for each facet until there is no new point outside every facets.

We can apply this recursive operation for higher k-dimensional space as we performed in [4]. In the k-dimensional case, new tangent point, which is found by the touching oracle, divides the initial facet into k facets.

In higher dimensional case, the normal vector of facet can be computed as follows:

3D case: Assume we have three points P 1 = (p11, p12, p13), P 2 = (p21, p22, p23) and P 3 =

(p31, p32, p33). If we imagine three points from three edges

in the plane then we can take two of the edges and calculate cross product between them. Suppose the edges vectors are

V 1 = (v11, v12, v13) = (p21, p22, p23)− (p11, p12, p13)

and V 2 = (v21, v22, v23) = (p31, p32, p33)

(p11, p12, p13). Then the normal is defined as the

expan-sion of the following symbolic determinant.

V 1⊗ V 2 =    e1 e2 e3 v11 v12 v13 v21 v22 v23   

Where e1, e2 and e3 are the elementary vectors (1,0,0),

(0,1,0) and (0,0,1) respectively.

4D case: Again assume that we have four points P 1 = (p11, p12, p13, p14), P 2 = (p21, p22, p23, p24), P 3 =

(p31, p32, p33, p34) and P 4 = (p41, p42, p43, p44).

Us-ing similar concept of 3D case we can compute three vectors V 1 = (v11, v12, v13, v14) = P 2 − P 1,

V 2 = (v21, v22, v23, v24) = P 3 − P 1 and V 3 =

(v31, v32, v33, v34) = P 4− P 1. Then the normal vector

can be defined as the expansion of the following symbolic

determinant V 1⊗ V 2 ⊗ V 3 =     e1 e2 e3 e4 v11 v12 v13 v14 v21 v22 v23 v24 v31 v32 v33 v34     where the e1, e2, e3 and e4 are the elementary vectors

(1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1).

kD case: The same concept can be expand for

k-dimensional case. Assume we have P 1 = (p11, p12, ..., p1k), P 2 = (p21, p22, ..., p2k), · · ·,

P k = (pk1, pk2, ..., pkk). Now, we can calculate (k− 1)

vectors like V 1, V 2, · · ·, Vk−1. Then, the normal can be defined as the expansion of the following determinant

V 1⊗ · · · ⊗ V (k − 1) =     e1 · · · ek v11 · · · v1k · · · · · · · · · v(k− 1)1 · · · v(k − 1)k     .

4

Experiments

We have conducted a series of experiments to evaluate the performance of our technique using both real data sets and synthetic data sets. Due to the space limitation, we only present the peformance for synthetic datasets in this extended abstract.

We restrict dimensionality within five, because from the user point of view we believe that higher dimension will in-crease the decision complexity and the result of skyline be-comes useless. The proposed algorithm was implemented using Java. All the experiments were conducted on a PC with an Intel(R) Core2 Duo, 2 GHz CPU and 3 GB main memory, running on Microsoft Windows XP operating sys-tems.

In Figure 4, we show the number of the skyline object-set with the influence of objectobject-set size on different dataobject-set. Three different datasets with cardinality 10k, 50k and 100k were used in this experiment. We vary objectset size be-tween 2 to 10. The results show that no. of skyline objectset maintain a positive correlation with objectset size i.e. when the objectset size increase, the number of returned skyline objectset also increase as well.

Figure 5 shows that 2D, 3D and 4D cases response time for the synthetic dataset with 100k. It has been seen that when we increase the objectset size the response time on each dimension also increases. The results maintain a pos-itive correlation with different objectset size. It is fluctuate for 5D case but also maintains a positive correlation.

In order to evaluate the effect of cardinality, we use small as well as large datasets with cardinality 1k, 10k, 50k and 100k. For this experiment, we fixed the objectset size with

(4)

Figure 4. Num. of Sky. varying ObjectSet Size

Figure 5. Time varying ObjectSet Size

10. Figure 6 shows that for small cardinality dataset, the response time for all dimensions also small and for larger cardinality datasets, the response time increases. For 2D, 3D and 4D cases, difference of the response time is small and increases gradually but for 5D case, it is large compared with others.

5

Conclusion

All existing database algorithms for skyline computation have deficiencies regarding to find group of objects at a

Figure 6. Time varying Dataset Size

time. In this paper, we proposed an algorithm to compute skyline for group of objects. We demonstrated that object-set skyline can be extended for high dimensional space and also for large objectset size. Experimental results showed that the proposed algorithm can find skyline objectset effi-ciently. In conclusion, we can say that the objectset skyline gracefully extends traditional skyline queries.

References

[1] S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline op-erator. In Proceedings of ICDE, pages 421–430, 2001. [2] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the

sky: An online algorithm for skyline queries. In Proceedings

of VLDB Conference, pages 275–286, 2002.

[3] C. Li, B. Ooi, A. Tung, and S. Wang. Dada: A data cube for dominant relationship analysis. In Proceedings of ACM

SIGMOD Conference, pages 659–670, 2006.

[4] Y. Morimoto, T. Fukuda, H. Matsuzawa, K. Yoda, and T. Tokuyama. Algorithms for mining association rules for binary segmentations of huge categorical databases. In

Pro-ceedings of VLDB Conference, pages 380–391, 1998.

[5] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal and progressive algorithm for skyline queries. In Proceedings of

ACM SIGMOD Conference, pages 467–478, 2003.

[6] K.-L. Tan, P.-K. Eng, and B. C. Ooi. Efficient progressive skyline computation. In Proceedings of VLDB Conference, pages 301–310, 2001.

[7] T. Xia, D. Zhang, and Y. Tao. On sky lining with flexible dominance relation. In Proceedings of ICDE, pages 1397– 1399, 2008.

Figure 1 shows a typical example of conventional sky- sky-line. The table in the figure is a list of hotels, each of which contains two numerical attributes distance and price, for  on-line booking
Table 1. Inner Product with Tangent Lines a (Θ 1 , a) (Θ 2 , a) (Θ 1,2 , a) h 1 -3 -8 -80 h 2 -5 -4 -68 h 3 -4 -3 -53 h 4 -9 -2 -86 h 5 -7 -3 -77
Figure 6. Time varying Dataset Size

参照

関連したドキュメント

Further, form (4.13) will be basic for the study of the stability of the zero solution in the case when this problem can be solved by means of terms of the first and second powers

In fact, the only points on H 1 (C) with two preimages on C are the two critical points of the mating. Finally, note that per our construction all deformations of C by H can

Furthermore, computing the energy efficiency of all servers by the proposed algorithm and Hadoop MapReduce scheduling according to the objective function in our model, we will get

In Section 3, we study the determining number of trees, providing a linear time algorithm for computing minimum determining sets.. We also show that there exist trees for which

In this paper, motivated by Yamada’s hybrid steepest-descent and Lehdili and Moudafi’s algorithms, a generalized hybrid steepest-descent algorithm for computing the solutions of

A wave bifurcation is a supercritical Hopf bifurcation from a stable steady constant solution to a stable periodic and nonconstant solution.. The bifurcating solution in the case

In this paper we show how to obtain a result closely analogous to the McAlister theorem for a certain class of inverse semigroups with zero, based on the idea of a Brandt

In this paper, this problem will be solved for the case N = 2, for tested convex sets of class C 4 and testing convex sets of class C 2 , as stated in Theorem 2.2 below. From now on,