Discovery of Closed Hyperclique Patterns in a Sequence of Graphs

(1)

1

International Workshop on Data-Mining and Statistical Science (DMSS2008)

Discovery of Closed Hyperclique Patterns in

a Sequence of Graphs

Tomonobu Ozaki

Organization of Advanced Science and Technology, Kobe University

tozaki@ cs.kobe-u.ac.jp, http://www25.cs.kobe-u.ac.jp/˜tozaki/

Takenao Ohkawa

Graduate School of Engineering, Kobe University

ohkawa@ kobe-u.ac.jp, http://www25.cs.kobe-u.ac.jp/˜ohkawa/

keywords: graph sequence, graph mining, correlation mining, closed pattern Summary

In this paper, as one of the pattern mining in dynamic graphs, we focus on the problem of finding frequent, closed and correlated patterns of graph sequences in a long sequence of graphs. To solve this problem efficiently, an algorithm named CHPSS is proposed which effectively utilizes the generality ordering and the properties of cor-relation and closedness. Through the preliminary experiments with synthetic and real datasets, the effectiveness of CHPSS is confirmed.

1. Introduction

Graph-structured data is widely used in many applica-tion domains such as bioinformatics, social networks, and so on. While several entity relations can be represented in graphs, all relations are not always static. For example, metabolic pathways are changed as the evolution of or-ganisms, and the roles of an user in a social network will change over time. These kinds of the dynamics of graphs attracts much attention and several research for helping the understanding of dynamic aspects of changing graphs have been reported recently [Borgwardt 06, You 08, Sun 07, Inokuchi 08].

In this paper, as one of the pattern mining in dynamic graphs, we tackle the problem of finding recurrent patterns in the form of graph sequence with wild cards from a long graph sequence. More concretely, to avoid the generation of a huge number of meaningless patterns, we discuss an algorithm for mining frequent, closed and mutually corre-lated patterns in a graph sequence.

2. Preliminaries

A graph g = (Vg, Eg, lg)on a set of labelsL consists of a vertex set Vg, an edge set Eg, and a labeling function lg: Vg∪ Eg→ L which assigns a label to each vertex or edge. As a pattern to be extracted, we consider a sequence of graphs with wild cards in the form of

P =g0,∗, ···, ∗, g1,∗, ···, ∗, g2,···, g_n−1∗, ···, ∗, gn (n≥ 1) where gi(i = 0,···,n) and ‘∗’ denote a graph and a wild card, respectively. |P | denotes the length of P (including wild cards). Given a pattern P , regardless of whether P [i] is a graph or a wild card, P [i] denotes the ith element in P (P [g0] = 1and P [gn] =|P |). pos(P,g)

rep-resents the position of g in P , i.e. pos(P, g) = i for g such that P [i] = g. We give the upper bound ω of the num-ber of repeat of wild card. More formally, pos(P, gi)− pos(P, g_i−1)≤ ω (i = 1,···,n) must hold. The set of all patterns satisfying this condition is denoted asGω. A pat-tern obtained by the concatenation of PB after PAis de-noted as PA· PB.

For graphs x = (Vx, Ex, lx)and y = (Vy, Ey, ly), x y denotes the existence of an injective function f : Vx→ Vy such that (1)∀u ∈ Vx[lx(u) = ly(f (u))]and (2)∀(u,v) ∈ Ex[(f (u), f (v))∈ Ey∧ lx(u, v) = ly(f (u), f (v))]. If x yholds, then we said x is more general than y. A pattern P is more general than a pattern Q, denoted as P Q, if ∃j s.t. ∀(i = 1,···,|P |)[P [i] = ∗ ∨ P [i] Q[j + i − 1]] is true.

LetD = d1, d2,···,dL be a sequence of graphs. For a

pattern P andD, the support σD(P )and the h-confidence θ_D(P )of P inD are formally defined as follows:

σ_D(P ) =|O_D(P )|/|D| where O_D(P ) ={j + |P | − 1 | 1 ≤ j ≤ |D|, P D[j],D[j + 1],···,D[j + |P | − 1]}

(2)

2 DMSS2008

θ_D(P ) = σ_D(P )/M_D(P ) where

M_D(P ) = max

i=1,···,|P |, P [i]=∗(σD(P [i]))

θ_D(P )is a natural extension of the h-confidence for an itemset[Xiong 06]. It corresponds to the minimum condi-tional probability of the appearance of a pattern P given a graph P [i]. Thus, this measure can be considered to repre-sent the mutual dependence among graphs w.r.t. a pattern. GivenD and a user defined threshold σ > 0, a pattern P is said to be frequent if σD(P )≥ σ holds. As similar, given a threshold θ≥ 0, a pattern P is said to be correlated if θ_D(P )≥ θ holds. We consider frequent and correlated patterns are interesting and significant.

Given two patterns Q and P ( Q), P can be regarded as redundant if P and Q have the same support and the same h-confidence. To avoid such redundant patterns, we consider a closed pattern which is a maximal pattern w.r.t. within the set of patterns having the same support and the same h-confidence. More formally, a pattern P is closed if ∃Q = P ∈ Gωs.t. σD(P ) = σD(Q), θD(P ) = θ_D(Q)and P Q.

Based on the above-mentioned preparations, we define our data mining problem. Given a sequence of graphsD and three thresholds ω≥ 0, σ(0 < σ ≤ 1), and θ(0 ≤ θ ≤ 1), then our purpose is to find all the frequent, closed and correlated patterns of graph sequences P∈ Gω fromD. For the sake of simplicity, a frequent, closed and corre-lated graph sequence is referred to as frequent closed hy-perclique graph sequences (FCCGSs in short).

3. Discovery of Frequent Closed Hyperclique

Graph Sequences

3·1 Properties of FCCGS

Two lemmas on support and h-confidence are introduced. [Lemma 1] Given two patterns P1and P2, if P1 P2

holds, then σ_D(P1)≥ σ_D(P2)also holds.

Proof Derived from the definition of support. 2

[Lemma 2] For two patterns P1= PA· PB and P2=

PA· PB, s.t. PB PB, the following inequality holds: up(P1, PA) = σD(P1)/M_D(PA)

≥ σD(P2)/M_D(P2) = θ_D(P2)

Proof σD(P1)≥ σ_D(P2)holds by lemma1 and P1

P2. Because of P2= PA· P_B , the inequality M_D(PA)≤

M_D(P2)holds. Thus, up(P1, PA)≥ θD(P2). 2

The properties on closedness in FCCGS are discussed. [Lemma 3] For two patterns P and Q, if OD(P ) = O_D(Q), θ_D(P ) = θ_D(Q), and P Q hold, denoted as P≡OQ, then σD(P) = σD(Q)and θD(P) = θD(Q) hold for P= P· α and Q= Q· α.

Proof From OD(P ) = OD(Q), σD(P) = σD(Q) holds. M_D(P ) = M_D(Q) and M_D(P) = M_D(Q) are derived by σ_D(P) = σ_D(Q)and θ_D(P ) = θ_D(Q). Thus,

θ_D(P) = θ_D(Q) 2

By this lemma, if there exist patterns P and Q s.t. P ≡O Q, no pattern in the form of P· α needs to be considered. [Lemma 4] Given two patterns P = Pp· Psand Q = Qp· Ps, if OD(P ) = OD(Q), MD(Pp) = MD(Qp), and Pp Qp hold, denoted as ≡ (P,Pp, Q), then σD(P) = σ_D(Q)and θ_D(P) = θ_D(Q)hold for P= Pp· Ps and Q= Qp· Pswhere Ps Ps.

Proof By the conditions, OD(P) = OD(Q)is ob-tained. This guarantees σ_D(P) = σ_D(Q). M_D(Pp) = M_D(Qp) derives MD(P) = MD(Q). Thus, θD(P) =

θ_D(Q)holds. 2

Similar to the case of lemma3, if patterns P and Q sat-isfy the above conditions, we can avoid the consideration of patterns P= Pp· Ps.

3·2 An algorithm CHPSS for mining FCCGSs An algorithm CHPSS for mining all FCCGSs is pro-posed. It can be regarded as an extension of the algo-rithm HSG[Ozaki 08] for finding frequent and correlated sets of subgraphs in a graph database. Similar to HSG, an effective search with some powerful pruning is real-ized in CHPSS by constructing the search space as pre-fix trees of graphs and by considering the relationship be-tween the generality ordering and the properties of FC-CGS. We show CHPSS in Figure 1 and explain it in detail. In addition to D, ω, σ and θ, CHPSS takes a set of frequent closed subgraphs F which can be obtained by the conventional graph miners (with some postprocess-ing). Then, a prefix tree P T is constructed to store all ele-ments inF and a wild card. P T = (V,E,r) is a tree where V is a vertex set, E⊆ V × V is an edge set, and r is a root. Each node v∈ V except r contains a graph or a wild card. The element in v is denoted as g(v). An edge (u, v)∈ E represents the generality relation between g(u) and g(v). For u= v ∈ V , if g(u) g(v) and ∃s ∈ V \ {u,v} s.t. g(u) g(s) g(v), then an edge (u,v) is in E. E also includes edges (r, v) for v if g(v) =∗ or ∃s ∈ V \ {v} s.t. g(s) g(v).

In CHPSS, the search is performed by the double traver-sals of two prefix trees Ta and Tbin TraS and TraG. Ini-tially, the traversal is invoked with Ta= Tb= r(line 3 in CHPSS). During the traversal in Ta with a pattern GS by TraS, each time a node va is visited, we consider a pat-tern P = GS· g(va) (line 2 in TraS). By invoking the traversal in Tb by TraG for this pattern (line 3 in TraS), all patterns in the form of P· g(vb) will be considered

(3)

Discovery of Closed Hyperclique Patterns in a Sequence of Graphs 3

Table 1 Statistics of datasets

|D| Va Ea |V | |E|

G1 1000 12.0 21.1 20 20

G2 5000 12.8 24.5 20 20

Email 795 14.8 12.5 10 1

|D|: # of transactions. Va,Ea: average number of vertices and edges per graph. |V |, |E|: # of distinct labels of vertices and edges. (line 2 in TraG). During the execution of TraG, two new prefix trees N Ta and N Tb will be constructed. N Ta is used for mining the patterns in the form of P= P· P s.t. |P| ≥ 2 by the new double traversals with P , NTa and r (line 4 in TraS). N Tb is used for considering the patterns GS· g(va) · α where g(va) g(va)by the re-cursive call of TraS (line 5 in TraS).

In the procedure TraG, several prunings are realized. The first pruning (line 4) is based on lemma1. Let pat= GS· α be a pattern s.t. pat pat. We need not to enu-merate pat, because σD(pat)≤ σD(pat) < σ. The sec-ond pruning (line 5) is based on θ_D(pat) < θby lemma2. The third pruning (line 6) comes from lemma4. There ex-ists a pattern which is more specific than pat and whose support and h-confidence are the same as those of pat. The admissibility of the fourth and fifth prunings (line 7) is guaranteed by lemma2 and lemma3, respectively. By not adding pat to N Ta, the patterns pat· α can be avoided. The sixth and seventh prunings (line 9) are based on lemma2 and lemma4. In this case, no pattern GS· g(va), g(v_b) s.t. g(vb) g(v_b)need to be considered. Note that, the check of the closedness (line 6 in TraS) as well as the conditions for the pruning(3), (5) and (7) can be realized by the similar technique for computing the ‘trans-action matching’ in the closed pattern mining [Chi 05].

Although no formal proof is given, because of the com-plete and non-duplicate enumeration by the double traver-sal and the admissible prunings based on the lemmas, CH-PSS can find all FCCGSs without duplications.

4. Experiments

Preliminary experiments are conducted on a PC (Intel(R) Core2Quad 2.4GHz with 4GB memory). A prototype of CHPSS is implemented in Java. Although not mentioned due to the space limitation, some other prunings based on the ‘occurrence matching’[Chi 05] are incorporated.

In the experiments, two synthetic datasets G1 and G2

generated by [Cheng 06] are used. In addition, a dataset Emailis prepared by extracting the email traffics based on a certain condition from ENRON email data[Klimt 04]. The statistics of datasets are summarized in Table 1.

The experimental results are shown in Table 2. As shown

in Table 2, FCCGSs are obtained within a reasonable com-putation time in all cases. It can be confirmed that both of the execution time and the number of candidate patterns increase as σ and θ decrease. In addition, similar results are confirmed when the wild card is introduced.

Despite of the decrease of σ, the obtained FCCGS hardly changes. On the other hand, the number of obtained pat-terns increases with the decrease of θ. The reason seems to be that low θ permits patterns to contain graphs having high support value. While it depends on the character of datasets strongly, the influence of θ might be larger than σ in the determination of FCCGS.

Compared with the naive algorithm which employs the pruning (1), (4) and (5), both of the execution time and the number of candidate patterns decrease greatly. These results are strong evidences to show the effectiveness of the pruning techniques.

5. Conclusion

In this paper, we consider the data mining problem of finding frequent closed hyperclique graph sequences in a sequence of graphs and propose an algorithm CHPSS.

For future work, (1)further experiments with large-scale real datasets and (2)detailed comparisons with related frame-works (e.g.[Inokuchi 08, You 08]) are necessary.

♦ References ♦

[Borgwardt 06] Borgwardt, K. M., Kriegel, H.-P., and Wacker-sreuther, P.: Pattern Mining in Frequent Dynamic Subgraphs, in Proc.

of the 6th International Conference on Data Mining, pp.818–822

(2006)

[Cheng 06] Cheng, J., Ke, Y., and Ng., W.: GraphGen: A graph syn-thetic generator, http://www.cse.ust.hk/graphgen/ (2006)

[Chi 05] Chi, Y., Xia, Y., Yang, Y., and Muntz, R. R.: Mining Closed and Maximal Frequent Subtrees from Databases of Labeled Rooted Trees, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 2, pp. 190–202 (2005)

[Inokuchi 08] Inokuchi, A. and Washio, T.: A Fast Method to Enu-merate Frequent Changing Patterns from Grahp Sequences, in

SIG-DMSM-A703-05, pp. 27–34 (2008) (in Japanese)

[Klimt 04] Klimt, B. and Yang. Y.: The Enron Corpus: A New Dataset for Email Classification Research, in Proc. of the 15th

Eu-ropean Conference on Machine Learning, pp. 217–226 (2004)

[Ozaki 08] Ozaki, T. and Ohkawa, T.: Mining Correlated Subgraphs in Graph Databases, in Proc. of the 12th Pacific-Asia Conference on

Knowledge Discovery and Data Mining, pp. 272–283 (2008)

[Sun 07] Sun, J., Faloutsos, C., Papadimitriou, S., and Yu, P. S.: GraphScope: parameter-free mining of large time-evolving graphs, in

Proc. of the 13th ACM SIGKDD International Conference on Knowl-edge Discovery and Data Mining, pp. 687–696 (2007)

[Xiong 06] Xiong, H., Tan, P.-N., and Kumar, V.: Hyperclique Pattern Discovery, Data Mining and Knowledge Discovery, Vol. 13, No. 2, pp. 219–242 (2006)

[You 08] You, hun C., Holder, L. B., and Cook, D. J.: Graph-based Temporal Mining of Metabolic Pathways with Microarray Data, in

Proc. of the 8th International Workshop on Data Mining in Bioinfor-matics, pp. 61–70 (2008)

(4)

4 DMSS2008

Algorithm CHPSS(F)

1: Construct a prefix tree P T fromF 2: r := root node of P T

3: TraS(, r, r)

Procedure TraS(GS, Ta, Tb) //preorder traversal in Ta 1: for each va∈ Ta’s children

2: N Ta:= new node; N Tb:= new node; P := GS· g(va)

3: TraG(GS, va, Tb, N Ta, N Tb) //specialize P by addition of a graph in Tb 4: TraS(P , N Ta, r) //search on new prefix tree

5: TraS(GS, va, N Tb) //recursive call for the traversal in Ta 6: if (P ∈ Gω ∧ P is closed) then output(P )

Procedure TraG(GS, va, Tb, N Ta, N Tb) //preorder traversal in Tb

1: for each vb∈ Tb’s child //check the candidate pat = GS· g(va), g(vb) 2: pat:= GS· g(va)· g(vb) //create a new pattern

3: if (∀(i = 0,···,ω)pat[|pat| − i] = ∗) then continue //violate the condition of wild card 4: if (σD(pat) < σ) then continue //pruning (1)

5: if (GS= ∧ up(pat,GS) < θ) then continue //pruning (2) 6: if (∃Q s.t. ≡ (pat,GS,Q)) then continue //pruning(3)

7: if ( θD(pat)≥ θ ∧ ∃Q s.t. pat ≡OQ) then //pruning(4) and (5)

8: C:= new node; g(C) := g(vb); add C to N Ta’s children; N Ta:= C //Update of N Ta 9: if ( up(pat, GS· g(va)) < θ ∨ ∃Q s.t. ≡ (pat,GS · g(va), Q)) then //pruning(6) and (7) 10: add vb(with all descendants) to N Tb’s children

11: else

12: C:= new node; g(C) := g(vb); add C to N Tb’s children; N Tb:= C //Update of N Tb 13: TraG(GS, va, vb, N Ta, N Tb) //recursive call for the travesal in Tb

Fig. 1 Pseudo code of CHPSS Table 2 Experimetal Results

σ ω P Time Cand. P Time Cand. P Time Cand.

G1 θ = 0.7 θ = 0.5 θ = 0.3 0.1 0 0 0.2 (61.8) 1.9 (55.5) 16 0.3 (66.8) 2.7 (57.4) 326 0.5 (47.0) 9.9 (57.6) 2 0 0.6 (65.9) 5.6 (55.6) 42 0.7 (41.8) 9.6 (46.9) 991 2.1 (27.3) 69.6 (51.5) 0.05 0 0 0.4 (46.0) 5.8 (35.7) 16 0.5 (51.8) 9.0 (46.3) 326 0.8 (31.2) 19.6 (32.4) 2 0 1.0 (37.5) 17.5 (35.9) 42 1.5 (33.2) 28.6 (39.6) 992 3.4 (14.8) 98.8 (21.9) 0.01 0 0 1.8 (16.6) 171.1 (28.3) 16 2.3 (17.8) 239.8 (34.0) 326 3.4 (9.9) 311.4 (16.0) 2 0 6.0 (14.1) 514.7 (28.5) 42 8.6 (12.8) 719.2 (27.7) 992 13.2 (4.3) 973.4 (7.0) G2 θ = 0.7 θ = 0.5 θ = 0.3 0.1 0 0 1.2 (61.4) 2.1 (51.6) 39 1.7 (54.2) 4.0 (51.9) 334 2.8 (34.5) 10.7 (45.1) 2 0 3.6 (58.2) 6.2 (51.4) 131 6.7 (28.8) 17.4 (36.1) 1005 16.8 (23.4) 72.5 (38.3) 0.05 0 0 2.1 (34.8) 9.9 (37.5) 39 3.2 (35.7) 15.7 (40.5) 334 5.2 (26.3) 25.8 (29.5) 2 0 7.2 (33.4) 29.7 (37.5) 131 12.6 (20.7) 52.9 (26.0) 1006 26.5 (14.4) 118.0 (18.9) 0.01 0 0 13.1 (13.6) 250.9 (24.9) 39 21.0 (15.4) 384.8 (27.3) 334 26.6 (8.4) 474.6 (13.8) 2 0 39.5 (10.9) 752.9 (24.9) 131 65.3 (7.3) 1160.7 (16.4) 1006 93.8 (3.4) 1466.2 (5.9) Email θ = 0.9 θ = 0.7 θ = 0.5 0.1 0 1 0.3 (55.8) 0.9 (45.3) 32 0.4 (33.4) 1.8 (47.6) 1095 2.5 (9.2) 16.8 (27.1) 1 2 0.5 (52.6) 1.8 (44.3) 106 0.9 (12.7) 5.3 (32.4) 25607 74.9 (4.5) 636.2 (22.1) 0.05 0 1 0.4 (54.4) 2.6 (36.2) 32 0.6 (27.8) 4.5 (38.9) 1097 2.8 ( 6.8) 21.0 (13.8) 1 2 0.6 (35.4) 4.8 (35.2) 106 1.1 ( 8.9) 10.1 (23.1) 25609 75.1 ( 3.0) 661.5 ( 9.3) 0.025 0 1 1.3 (39.5) 7.7 (28.6) 32 1.4 (19.0) 12.9 (32.8) 1099 3.8 (3.7) 32.6 (7.3) 1 2 2.3 (39.6) 13.8 (27.4) 106 3.1 (8.7) 25.1 (18.6) 25611 77.0 (1.2) 682.8 (3.4)

P: number of obtained FCCGS. Time: execution time afterF(a set of frequent closed subgraphs) is given (in second).

Cand.: number of candidate patterns generated during the search process (in thousand).

Number in parentheses : the ratio (in percentage) to the naive algorithm in which only the pruning (1), (4) and (5) are applied.