Algorithm for Computing Skyline ObjectSet on Numerical Databases

(1)

Algorithm for Computing Skyline ObjectSet on Numerical Databases

Md. Anisuzzaman Siddique

∗

Tauﬁk Djatna

Yasuhiko Morimoto

Hiroshima University

1-7-1 Kagamiyama, Higashi-Hiroshima, 739-8521, Japan

{d074370@, tauﬁkdjatna@, morimoto@mis.}hiroshima-u.ac.jp, +81-82-424-5579

1 Introduction

The skyline query has recently attracted considerable at-tention due to its importance in many applications such as multi-criteria decision making, data mining, and user pref-erence queries [7]. Given a set of attributes, and a database

D, an object p is said to be in skyline of D if there is no

object q in D such that q is better than p in all dimension. If there exist such a q, then we say that p is dominated by q or

q dominates p.

Figure 1 shows a typical example of conventional sky-line. The table in the ﬁgure is a list of hotels, each of which contains two numerical attributes distance and price, for on-line booking. In the list, the best choice to a client comes from the skyline i.e. any of {h1, h3, h4} (See Figure 1

(b)). A number of efﬁcient algorithms for computing sky-line have been reported in the literature [1, 2, 5, 6, 3].

Though the skyline queries give us important informa-tion in the database, they are insufficient if a customer is interested for more than one object at a time in a single query. For example, if a customer has to reserve several different hotels, she or he has to find a best set of hotels. This paper proposes new approach to create skyline for objectsets that can give significant information for the set optimization problems.

Motivated Example

Assume that an organizing committee is going to or-ganize a conference/workshop and for all participants they need to reserve three hotels around the conference venue as we discuss in Figure 1. If the organizer select three hotels

{h1, h3, h4} using conventional skyline method might

be a wrong decision. Because participants don’t want to stay at hotel h1, which is long far from the conference

venue and the organizer also don’t want to reserve hotel

h4 because of high price. So it is clear that conventional

skyline doesn’t provide sufﬁcient information for group choice. To solve this problem we propose new function called skyline objectset query.

ID Price Dist h1 3 8 h2 5 4 h3 4 3 h4 9 2 h₅ 7 3

Figure 1. Skyline Example

Figure 2 (a) is a list of 3-objectsets, in which all of the combinations of three hotels are listed. The “h123” is a set

of{h1,h2,h3}. “Dis” and “Pri” of “h123” is the sum of the

“distance” and “price”, resp., of hotels in the set. If we are looking for best three hotels in a certain criterion, the skyline objectset with the size of 3 contains the preferable answer. As we can see in Figure 2 (b), we have ﬁve alterna-tives{h123, h135, h235, h234, h345} in the database.

To the best of our knowledge, the problem of skyline objectset has not been investigated by any of the previous related work.

The contributions of this paper are as follows: i) We in-troduce new concept skyline objectset and ii) develop an ef-ﬁcient algorithm for computing the skyline objectset. And, iii) we performed intensive experiments.

2 Skyline ObjectSet Problem

Given a database having n records. A user speciﬁes the size of objectset s and a subspace having k attributes among attributes in the database. In the rest of this paper, we con-sider the projected database D having k attributes and n records without losing generality of the problem.

(2)

ID Pri Dis ID Pri Dis h₁₂₃ 12 15 h₁₄₅ 19 13 h124 17 14 h234 18 9 h125 15 15 h235 16 10 h134 16 13 h245 21 9 h₁₃₅ 14 14 h₃₄₅ 20 8

Figure 2. ObjectSet Skyline

Deﬁnition 1 (s-ObjectSet) A s-objectset is a set of s

records among records in D.

We assume s is a relatively small number such that 2≤ s ≤ 10 in this paper though we can compute s-objectsets for much larger s. Let S be a set of all s-objectsets in D. Note that the number of s-objectsets in D is nCs = _(n−s)!s!n! ,

which we denote the number |S|. Now we consider a database of S on the k dimensional space of D. Each record of the database is an s-objectset in D and the value of each attribute (each dimension) is the sum of the s records in the objectset.

Deﬁnition 2 (Dominating objectset) An objectset p ∈ S

is said to dominate another objectset q ∈ S if p is same or better than q on all attributes and better than on at lease one attribute.

Deﬁnition 3 (Skyline ObjectSet) An objectset p ∈ S is

said to be skyline objectset if p is not dominated by any other objectset in S.

The problem of skyline objectset aims to ﬁnd those sky-line objectsets and help users to take decision within the list of objectsets.

3 Algorithm for Skyline ObjectSet

After computing all the records in S, the problem of sky-line objectset can be solved by conventional skysky-line query algorithms. However, since there are nCs = _(n−s)!s!n!

records in S, it is not affordable to compute all the records in S. In this paper, we consider an algorithm for ﬁnding all skyline objectsets in S on the convex hull of S efﬁciently without computing all the records in S.

3.1 Touching Oracle

Each record in S can be represented as a k-dimensional point x = (x1, x2, ..., xk) where coordinate value of the

Figure 3. Convex Hull and Touching Oracle

Table 1. Inner Product with Tangent Lines a (Θ₁, a) (Θ₂, a) (Θ_1,2, a) h₁ -3 -8 -80 h₂ -5 -4 -68 h₃ -4 -3 -53 h₄ -9 -2 -86 h₅ -7 -3 -77

point are values of k attributes of the record. Therefore, the database S is a set of |S| points in the k-dimensional space. Points on the convex hull of the set of |S| points can be computed efﬁciently by, so called, touching oracle function.

Touching oracle function ﬁnds the tangent point of the convex hull of the set of|S| points and a hyperplane whose normal vector is Θ.

Example

Consider the examples of Figure 1 and Figure 2 again. There are 10 points in |S| if the size of an objectset is three. Since there are two attributes in the database, those 10 points are in two-dimensional space as in Figure 2 (b). The dotted polygon in Figure 3 is the convex hull of the 10 points. In the example, there are ﬁve records in the origi-nal databases as in Figure 1. Each of the ﬁve record is also represented as a two-dimensional point, which we call an atomic point and denote a.

Assume there is a line whose normal vector is Θ1 =

(−1, 0) in the two-dimensional space. In order to ﬁnd the tangent point with the line and the convex hull without pre-computing all points in S, we compute inner products of the normal vector and each atomic points, (Θ1, a) as in the ﬁrst

column of Table 1. We choose the top three inner products, i.e.,{h1, h2, h3}. Those top three inner products composes

the tangent point (12, 15), which is the 3-objectset, h123.

Similarly, for a line with Θ2 = (0,−1), we can ﬁnd

3-objectset{h3, h4, h5} is the top three in the inner products,

(Θ2, a). It is the tangent point (20, 8).

Note that once we specify a normal vector, we can ﬁnd

(3)

the tangent point with convex hull, which is a skyline s-objectset, in O(n) by the touching oracle function.

3.2 Convex Hull Search

First of all, we compute initial k tangent points that can be computed by touching oracle with initial k vectors Θ = (θ1, θ2, ..., θk) where θi =−1 if i = x, otherwise θi = 0

for each x = 1, ..., k.

After computing the initial k tangent points, we have the initial facet (k-dimensional hyperplane). Next, we compute the normal vector of the facet. In the example above, we have initial two tangent points p1 = (12, 15) and p2 =

(20, 8) with initial two vectors (−1, 0) and (0, −1). Using the facet containing the two initial points, we can compute the normal vector of the facet as Θ1,2 = ((12−20), −(15−

8)). Using the normal vector, we can ﬁnd the tangent point

h235, which is (16, 10). The new tangent point divide the

initial facet into two facets. We recursively compute tan-gent point for each facet until there is no new point outside every facets.

We can apply this recursive operation for higher k-dimensional space as we performed in [4]. In the k-dimensional case, new tangent point, which is found by the touching oracle, divides the initial facet into k facets.

In higher dimensional case, the normal vector of facet can be computed as follows:

3D case: Assume we have three points P 1 = (p11, p12, p13), P 2 = (p21, p22, p23) and P 3 =

(p31, p32, p33). If we imagine three points from three edges

in the plane then we can take two of the edges and calculate cross product between them. Suppose the edges vectors are

V 1 = (v11, v12, v13) = (p21, p22, p23)− (p11, p12, p13)

and V 2 = (v21, v22, v23) = (p31, p32, p33) −

(p11, p12, p13). Then the normal is deﬁned as the

expan-sion of the following symbolic determinant.

V 1⊗ V 2 = e1 e2 e3 v11 v12 v13 v21 v22 v23

Where e1, e2 and e3 are the elementary vectors (1,0,0),

(0,1,0) and (0,0,1) respectively.

4D case: Again assume that we have four points P 1 = (p11, p12, p13, p14), P 2 = (p21, p22, p23, p24), P 3 =

(p31, p32, p33, p34) and P 4 = (p41, p42, p43, p44).

Us-ing similar concept of 3D case we can compute three vectors V 1 = (v11, v12, v13, v14) = P 2 − P 1,

V 2 = (v21, v22, v23, v24) = P 3 − P 1 and V 3 =

(v31, v32, v33, v34) = P 4− P 1. Then the normal vector

can be deﬁned as the expansion of the following symbolic

determinant V 1⊗ V 2 ⊗ V 3 = e1 e2 e3 e4 v11 v12 v13 v14 v21 v22 v23 v24 v31 v32 v33 v34 where the e1, e2, e3 and e4 are the elementary vectors

(1,0,0,0), (0,1,0,0), (0,0,1,0) and (0,0,0,1).

kD case: The same concept can be expand for

k-dimensional case. Assume we have P 1 = (p11, p12, ..., p1k), P 2 = (p21, p22, ..., p2k), · · ·,

P k = (pk1, pk2, ..., pkk). Now, we can calculate (k− 1)

vectors like V 1, V 2, · · ·, V_k−1. Then, the normal can be deﬁned as the expansion of the following determinant

V 1⊗ · · · ⊗ V (k − 1) = e1 · · · ek v11 · · · v1k · · · · · · · · · v(k− 1)1 · · · v(k − 1)k .

4 Experiments

We have conducted a series of experiments to evaluate the performance of our technique using both real data sets and synthetic data sets. Due to the space limitation, we only present the peformance for synthetic datasets in this extended abstract.

We restrict dimensionality within ﬁve, because from the user point of view we believe that higher dimension will in-crease the decision complexity and the result of skyline be-comes useless. The proposed algorithm was implemented using Java. All the experiments were conducted on a PC with an Intel(R) Core2 Duo, 2 GHz CPU and 3 GB main memory, running on Microsoft Windows XP operating sys-tems.

In Figure 4, we show the number of the skyline object-set with the inﬂuence of objectobject-set size on different dataobject-set. Three different datasets with cardinality 10k, 50k and 100k were used in this experiment. We vary objectset size be-tween 2 to 10. The results show that no. of skyline objectset maintain a positive correlation with objectset size i.e. when the objectset size increase, the number of returned skyline objectset also increase as well.

Figure 5 shows that 2D, 3D and 4D cases response time for the synthetic dataset with 100k. It has been seen that when we increase the objectset size the response time on each dimension also increases. The results maintain a pos-itive correlation with different objectset size. It is ﬂuctuate for 5D case but also maintains a positive correlation.

In order to evaluate the effect of cardinality, we use small as well as large datasets with cardinality 1k, 10k, 50k and 100k. For this experiment, we ﬁxed the objectset size with

(4)

Figure 4. Num. of Sky. varying ObjectSet Size

Figure 5. Time varying ObjectSet Size

10. Figure 6 shows that for small cardinality dataset, the response time for all dimensions also small and for larger cardinality datasets, the response time increases. For 2D, 3D and 4D cases, difference of the response time is small and increases gradually but for 5D case, it is large compared with others.

5 Conclusion

All existing database algorithms for skyline computation have deﬁciencies regarding to ﬁnd group of objects at a

Figure 6. Time varying Dataset Size

time. In this paper, we proposed an algorithm to compute skyline for group of objects. We demonstrated that object-set skyline can be extended for high dimensional space and also for large objectset size. Experimental results showed that the proposed algorithm can ﬁnd skyline objectset efﬁ-ciently. In conclusion, we can say that the objectset skyline gracefully extends traditional skyline queries.

References

[1] S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline op-erator. In Proceedings of ICDE, pages 421–430, 2001. [2] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the

sky: An online algorithm for skyline queries. In Proceedings

of VLDB Conference, pages 275–286, 2002.

[3] C. Li, B. Ooi, A. Tung, and S. Wang. Dada: A data cube for dominant relationship analysis. In Proceedings of ACM

SIGMOD Conference, pages 659–670, 2006.

[4] Y. Morimoto, T. Fukuda, H. Matsuzawa, K. Yoda, and T. Tokuyama. Algorithms for mining association rules for binary segmentations of huge categorical databases. In

Pro-ceedings of VLDB Conference, pages 380–391, 1998.

[5] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal and progressive algorithm for skyline queries. In Proceedings of

ACM SIGMOD Conference, pages 467–478, 2003.

[6] K.-L. Tan, P.-K. Eng, and B. C. Ooi. Efﬁcient progressive skyline computation. In Proceedings of VLDB Conference, pages 301–310, 2001.

[7] T. Xia, D. Zhang, and Y. Tao. On sky lining with ﬂexible dominance relation. In Proceedings of ICDE, pages 1397– 1399, 2008.