Feasibility of Graph Sequence Mining based on Admissibility Constraints

(1)

1

International Workshop on Data-Mining and Statistical Science (DMSS2008)

Feasibility of Graph Sequence Mining

based on Admissibility Constraints

Akihiro Inokuchi

The Institute of Scientific and Industrial Research, Osaka University

[email protected], http://www.ar.sanken.osaka-u.ac.jp/˜inokuchi/

Takashi Washio

The Institute of Scientific and Industrial Research, Osaka University

[email protected], http://www.ar.sanken.osaka-u.ac.jp/˜washio/washprjp.html

keywords: graph mining, graph sequence, frequent transformation subsequence Summary

In recent years, the mining of frequent subgraphs from labeled graph data has been extensively studied. How-ever, to our best knowledge, almost no methods have been proposed to find frequent subsequences of graphs from a set of graph sequences where the numbers of vertices and edges increase or decrease. In this paper, we define a novel class of graph subsequences by introducing axiomatic rules of graph transformation, their admissibility constraints and a union graph. Then we propose an efficient approach named “GTRACE” to enumerate frequent transformation subsequences (FTSs) of graphs from a given set of graph sequences. Its fundamental performance has been evaluated by using artificial datasets, and its practicality has been confirmed through the experiments using real world datasets.

1. Introduction

In recent years, the studies of data mining have estab-lished many approaches to find characteristic patterns from various structured data. Sequential Pattern Mining such as AprioriSome [Agrawal 95] and PrefixSpan [Pei 01] is to efficiently find a complete set of subsequences which appear more frequently than a threshold in a given set of itemset sequences. These approaches are limited to mine total order relations among itemsets each of which repre-sents co-occurrence of events. In contrast, WINEPI and MINEPI [Mannila 95], which are more advanced Sequen-tial Pattern Mining, are to find frequent episodes repre-senting partial order relations among event occurrences. However, the relations are still limited to orders and do not cover any semantic and/or topological matters.

Graph Mining, which efficiently mines all subgraphs appearing more frequently than a given threshold from a set of graphs, focuses the topological relations among events. AGM [Inokuchi 00], gSpan [Yan 02], and Gas-ton [Nijssen 04] mine frequent subgraphs levelwisely start-ing from those of size 1 by usstart-ing the anti-monotonic prop-erty on the support values. Gaston is known to be the fastest algorithm. Its efficiency is from the use of the sparsity of graphs in most of real-world applications. Al-though major algorithms of Graph Mining are practically quite efficient, they need much computation time to mine complex frequent subgraphs due to the NP-completeness

of subgraph isomorphism matching [Garey 79]. Accord-ingly, these conventional methods are not well applicable to more complex graphs such as graph sequences.

However, there are many real-world applications suit-able to model objects by using graph sequences. For ex-ample, a human network is represented by a graph where each human and each relationship between two humans correspond to a vertex and an edge, respectively. If a person joins or leaves a human community, the numbers of vertices and edges in the graph increase or decrease. Similarly, a gene network consisting of genes and their interactions produces a graph sequence in their evolution-ary history. Most of these graphs in the real world are known to have long tail distributions of vertex degrees where most of the vertices have a few edges [Barab´asi 99], and thus they are sparse. We focus the topological changes in the sequences of sparse graphs, because in many cases their changes come from some underlying mechanisms re-stricted by the topology such as distances (dissimilarities) between vertices and information traveling on the graphs.

2. Graph Sequence

Figure 1 (a) shows an example of an observed graph sequence. An observed graph sequence is represented as d =g(1) _g(2)_···g(n)_{, where the superscript integer of} each g(j)_{represents the ordered step of the observation.} We introduce two practical Assumptions. One is that “the

(2)

2 DMSS2008

Fig. 1 Examples of a graph sequence and a mined graph subsequence.

change is gradual”, i.e., only a small part of the structure changes and the other part remains unchanged between two successive graphs g(j)and g(j+1). The other is “the sparsity of every graph g(j)”. In the aforementioned hu-man network and the gene network, these assumptions well hold, since most of the changes of the vertices are progressive over the steps, and the vertices are not very densely coupled one another at every instant.

A compact representation of graph sequences should be introduced in order to reduce both computational cost and spatial cost to mine the frequent transformation sub-sequences (FTSs). The direct representation of the graph sequence is not compact, because many parts of a graph unchanged over multiple steps is redundant in the repre-sentation. Accordingly, we propose the following novel graph grammatical framework adapted to easily and com-pactly describe a graph sequence.

3. GTRACE

We propose a novel representation of graph sequences by using six transformation rules (TRs) of insertion, dele-tion, and relabeling of vertices and edges to compactly represent the graph sequences by focusing differences be-tween two successive graphs in each graph sequence ac-cording to Assumption of the gradual changes. A given set of graph sequences are compiled to a set of sequences of TRs, called transformation sequences. In addition, we propose an algorithm to mine a complete set of frequent subsequences, called “frequent transformation subsequences (FTSs)”, embedded in the set of sequences of TRs. Be-cause the transformation rules follow their associated ad-missibility constraints, our algorithm can efficiently mine a complete set of FTSs from a set of sequences of TRs by combining the constraints with the conventional Sequen-tial Pattern Mining method. Furthermore, we propose an algorithm to mine FTSs corresponding to “relevant” graph subsequences based on “union graphs” which represent re-lations among relevant vertices in graph sequences. This is to focus on graph subsequences consisting of mutually relevant vertices and edges only, since our interest is lim-ited to the relations among relevant persons, events, phe-nomena and so on in many practical cases. Our proposed approach based on these algorithms is called “GTRACE (Graph TRAnsformation sequenCE mining)”.

Fig. 2 Conversion from a graph sequence to a graph.

Table 1 Parameters of test data and their default values. Definition Default values probability of vertex and edge insertions

in transformation sequences p = 80% average number of unique IDs

in transformation sequences |Vavg| = 7

probability of vertex and edge insertions

in embedded FTSs p= 50% average number of unique IDs in embedded FTSs |Vavg | = 5

number of vertex labels |Lv| = 10

number of edge labels |Le| = 1

number of embedded FTSs N = 10 number of transformation sequences |DB| = 1,000 minimum support threshold σ= 30%

4. Experiment

GTRACE was implemented by C++. HP xw4400 with Windows XP was used for this experiments where Intel Core 2 6700 2.66 GHz and 2 GB of main memory are installed. The performance of GTRACE is evaluated for both artificial and real world graph sequence data. Be-cause any other established approach comparable with our frequent graph subsequence mining is not available, we set up a mining task which is functionally similar within the conventional Graph Mining for the comparison. Each graph sequence d= g(1)···g(n) ∈ DB is converted to the following graph g as shown in Figure 2. Each unique ID u, if it appears as a vertex having a label l in g(j), is represented by a vertex having a label l:P (Presence) in the j-th column of g as shown in Figure 2 (b), otherwise

u is represented by a vertex having a label A (Absence).

If there is an edge between two vertices in g(j), an edge is placed between the corresponding vertices in the j-th col-umn of g. Finally, all vertices corresponding to an identi-cal ID u are mutually connected by edges to form a clique. This is to represent the identity of vertices in the graph. Each clique is indicated by a rectangle in Figure 2 (b). The Graph Mining on this data extracts the information of the presence/absence of each vertex with their associated edges over the sequential steps. We applied Gaston [Ni-jssen 04] to this task, since it is known to be the fastest algorithm of Graph Mining.

4·1 Artificial Datasets

We compared the performance of GTRACE with Gas-ton by using artificial datasets generated by the parame-ters listed in Table 1. First, we grew each transformation

(3)

Feasibility of Graph Sequence Mining based on Admissibility Constraints 3

Table 2 Results for|Vavg|, p, |Lv|, and σ.

|Vavg|, p 6 10 35 95% 80% 10% avg. length 12.1 22.2 82.1 14.3 15.7 30.2 GT comp. time 0.27 0.70 224 0.38 0.42 0.59 # of FTSs 15 84 2622 44 50 133 comp./FTS 0.018 0.0084 0.086 0.0085 0.0085 0.0045 Ga comp. time 8.05 – – 119.0 416.5 – # of FTSs 600 – – 7289 23594 – comp./FTS 0.013 – – 0.016 0.018 – |Lv|, σ 1 9 10 11 0.3% 35% avg. length 15.7 15.7 15.7 15.7 15.7 15.7 GT comp. time 0.56 0.38 0.42 0.39 14.7 0.13 # of FTSs 189 47 50 53 160658 36 comp./FTS 0.0030 0.0080 0.0085 0.0074 9.1E-5 3.2E-3 Ga comp. time – – 416.5 250.2 – 124.89 # of FTSs – – 23594 13737 – 5292 comp./FTS – – 0.018 0.018 – 2.3E-2 avg. length: average length of seq(d), GT: GTRACE, Ga: Gaston,

comp. time: computation time, comp./FTSs: comp. time per FTS, # of FTSs: the number of mined FTSs or frequent subgraphs

sequence seq(d) up to having |Vavg| unique IDs in the

av-erage by inserting vertices and edges with probability p at every step. Accordingly, if p is small or|Vavg| is large, the

finally produced sequence, i.e., the length of the generated transformation sequence, is long. This process is contin-ued until the sequence becomes relevant by the increase of the edges while keeping the sparsity of each graph. Typ-ically, the average probability of edge existence between two vertices in a graph was 13% under the default val-ues of the parameters. This value remains low over all test data. Similarly, we generated N relevant FTSs hav-ing|V_avg | unique IDs in the average under the probability

p. Then, we generated DB where each transformation sequence was overlaid by each relevant FTSs under the probability1/N. The overlay was conducted in the graph sequence domain to ensure the topologically proper over-lay. Each relevant transformation sequence contains|Lv|

vertex labels and|Le| edge labels, respectively.

Table 2 shows the computation times, the numbers of derived FTSs, and the average computation times to derive a FTS under various values of|Vavg|, p, |Lv|, and a

min-imum support threshold σ, respectively, while the other parameters are set at their default values. The support value of a FTS is defined as the fraction of total graph se-quences similar to the conventional frequent pattern min-ing. Though we designed the mining task of Gaston to make similar to the task of GTRACE as much as we can, the numbers of FTSs in their solutions are very different. Accordingly, we also use the computation time per FTS for the comparison. The results indicated by “–” in the ta-ble were not obtained due to intractata-ble computation time over one hour. The upper half of the table indicates that the computation time of both GTRACE and Gaston are expo-nential to the average length of seq(d) provided by the set-tings of|Vavg| and p. The main reason of the computation

time increase along the average length is considered to be the increase of the numbers of FTSs in both cases, because the computation times per FTS do not significantly vary over the conditions. However, the far better efficiency of GTRACE than Gaston’s application is confirmed in terms

Table 3 Results for Enron dataset.

# of persons|V | 20 40 60 80 90 avg. len. (5 days) 9.7 24.9 35.5 65.5 84.3 GT5 comp. time 0.031 0.093 0.13 1.20 36.1 # of FTSs 3 10 18 109 420 comp./FTS 0.010 0.0093 0.007 0.011 0.086 Ga5 comp. time 3.2 6.6 10.3 48.8 –

# of FTSs 84 310 481 1463 – comp./FTS 0.038 0.021 0.022 0.033 – avg. len. (6 days) 11.8 29.4 42.0 77.3 100.1 GT6 comp. time 0.031 0.093 0.14 1.5 303.5 # of FTSs 6 27 48 278 1734 comp./FTS 0.0052 0.0034 0.0029 0.0053 0.18 Ga6 comp. time 59.9 – – – –

# of FTSs 500 – – – – comp./FTS 0.12 – – – – min. sup. σ 15% 16% 18% 20% 30% GT5 comp. time 6359 4167 1504 668 7.7 # of FTSs 42961 30250 16019 7483 498 comp./FTS 0.15 0.14 0.094 0.089 0.015 Default: minimum support σ=50%, the number of vertex labels |Lv| = 3,

the number of persons|V | = 50, GT5: GTRACE (5 days graphs), GT6: GTRACE (6 days graphs), Ga5: Gaston (5 days graphs)

Table 4 Results for MIT dataset.

# of persons|V | 5 7 10 15 20 avg. len. (5 days) 6.5 11.3 19.5 40.5 62.2 GT5 comp. time 0.016 0.031 0.047 9.02 1477.5

# of FTSs 5 18 100 1479 11725 comp./FTS 0.0032 0.0017 0.00047 0.0061 0.13 Ga5 comp. time 0.14 32.03 – – –

# of FTSs 90 5596 – – –

comp./FTS 0.0016 0.0057 – – – avg. len. (6 days) 7.5 13.1 22.6 47.7 73.0 GT6 comp. time 0.016 0.032 0.109 104.0 –

# of FTSs 9 23 195 4196 – comp./FTS 0.0018 0.0014 0.00056 0.025 – Ga6 comp. time 2.92 – – – –

# of FTSs 435 – – – –

comp./FTS 0.0067 – – – – min. sup. σ 5% 10% 15% 20% 40% GT5 comp. time 63.25 8.08 2.41 0.89 0.093

# of FTSs 724094 47863 11201 3841 263 comp./FTS 8.7E-5 1.7E-4 2.1E-4 2.3E-4 3.5E-4 Default: minimum support σ=50%, the number of vertex labels |Lv| = 3,

the number of persons|V | = 10, GT5: GTRACE (5 days graphs),

GT6: GTRACE (6 days graphs), Ga5: Gaston (5 days graphs)

of both the computation times per FTS and the number of focused FTSs.

The lower left part in Table 2 shows the effect of the number of labels on the efficiency. When |Lv| is small,

many subgraphs included in the graph g as depicted in Figure 2 (b) are isomorphic each other, and thus the com-putation time of Gaston is enormous. In contrast, that of GTRACE remains small since|Lv| does not affect the

length of seq(d). The lower right part in Table 2 shows that GTRACE is tractable even under the low minimum support threshold because it involves only the Graph Min-ing of small union subgraphs to ensure the relevance of FTSs and the pattern growth based Sequential Pattern Ming for the FTS minMing, whereas Gaston’s application in-cludes the Graph Mining of large graphs as depicted in Figure 2 (b). The good scalability of GTRACE is indi-cated in every part of Table 2 since its computation times per FTS remain almost same or is even significantly re-duced in some conditions.

4·2 Real World Datasets

To assess the practicality of GTRACE, we apply it to two real world datasets. One is Enron Email Dataset [En-ron] and the other is phone call histories from Reality Min-ing Project at MIT [MIT]. In both data, we assigned a unique ID to each person participating the communica-tion, assigned an edge to a pair of persons if they

(4)

commu-4 DMSS2008

nicate via either email or phone call in a day, and obtained a daily graph g(j)_{. We further categorized the persons into} 3 labels according to their daily vertex degrees as high, moderate and low degrees. The person having a high de-gree label is considered to be a hub person in the organi-zation. We obtained a set of weekly graph sequence data, i.e., DB. The total numbers of weeks, i.e., number of se-quences, are 200 in Enron Email Dataset and 40 in MIT

Dataset. We randomly sampled|V |(= 1 ∼ 90) persons in

Enron Email Dataset and|V |(= 1 ∼ 20) persons in MIT

Dataset to form each DB.

Table 3 and Table 4 show the computation times (comp. time), the numbers of enumerated FTSs (# of FTSs), and computation times per FTS (comp./FTS) resulted under various numbers of unique IDs (persons)|V | and the

min-imum support σ for both data while the other

parame-ters are set at default values indicated at the bottoms of the tables. GT5 and Ga5 indicate that each sequence d in DBconsists of five graph steps g(j)_{from Monday to} Fri-day. Similarly, GT6 and Ga6 indicate the case of six graph steps g(j)_{from Monday to Saturday. The average length} (avg. len.) stands for the average number of TRs in trans-formation sequences. In case of the required computation times over one hour, the results are indicated by “–”.

Upper parts in these tables show superior scalability of GTRACE with regard to the computation time and com-putation time per FTS. In contrast, Gaston’s application is not tractable, because many persons communicate with someone only on a few days in the data. This produces many unique IDs labeled as A (Absence) in each rele-vant graph sequence, and the cliques made of these absent unique IDs produce vast number of spurious frequent sub-graphs. The computation times required in MIT Dataset are far larger in spite of its smaller numbers of persons. This is because the daily graphs in MIT Dataset are denser than those of Enron Email Dataset, since MIT Dataset is from phone calls within a tightly coupled community. Lower parts of the tables show the practical scalability of GTRACE with regard to the minimum support threshold.

5. Discussion and Conclusion

Recently, Borgwardt et. al proposed a method to mine frequent patterns from a set of graph sequences where only the number of edges increase or decrease, but the number of vertices does not, and edges and vertices are not rela-beled [Borgwardt 06]. Since the existence of an edge be-tween two vertices in a graph sequence can be represented by a binary string and the binary string can be treated as a label of an edge in a graph, the problem to mine the

fre-quent patterns in their graph sequences is tractable by the conventional Graph Mining. However, the method can not be applied to graph sequences where the number of edges in the graph increase or decrease such as human commu-nities and gene networks as mentioned in Section 1.

The study of Temporal Logic is a relevant topic to de-scribe time sequence events [Venema 01]. Some recent data mining study has introduced this framework to ana-lyze time sequence data [Ho 07]. The temporal logic is to describe the relations among times and time intervals when events hold. Our TRs can be seen as a set of simple temporal operators. The introduction of advanced tem-poral logics may enhance the expressiveness of the graph subsequences. Another possibility to extend the expres-siveness is the use of Graph Grammar [Jeltsch 90]. Graph Grammar is a grammar which transforms a graph to an-other graph. However, in both cases, the computational tractability is a big issue.

In this paper, we proposed a novel approach to mine a complete set of relevant FTSs from given graph sequences. We developed a graph sequence mining program GTRACE, and confirmed its efficiency and practicability through com-putational experiments using artificial and real world datasets.

♦ References ♦

[Agrawal 95] R. Agrawal, R. Srikant. Mining Sequential Patterns. Proc. of Int’l Conf. on Data Engineering, pp. 3–14, (1995) [Barab´asi 99] A. L. Barab´asi and R. Albert. Emergence of scaling in

random networks. Science, 286, pp. 509–512, (1999)

[Borgwardt 06] K M. Borgwardt, et. al. Pattern Mining in Frequent Dynamic Subgraphs. Proc. of Int’l Conf. on Data Mining, pp. 818– 822. (2006)

[Enron] Enron Email Dataset, http://www.cs.cmu.edu/˜enron/ [Garey 79] M. Garey and D. Johnson. Computers and Intractability:

A Guide to the Theory of NP-Completeness, W.H. Freeman. (1979) [Ho 07] T.B. Ho, et al. Exploiting Temporal Relations in Mining

Hep-atitis Data, New Generation Computing, Vol. 25, pp. 247–262. (2007) [Inokuchi 00] A. Inokuchi, T. Washio, and H. Motoda. An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data. Proc. of European Conf. on Principles of Data Mining and Knowledge Discovery, pp. 13–23. (2000)

[Jeltsch 90] E. Jeltsch and H. Kreowski. Grammatical Inference Based on Hyperedge Replacement. Graph-Grammars. Lecture Notes in Computer Science 532, pp. 461–474. (1990)

[Mannila 95] H. Mannila, et. al. Discovering Frequent Episodes in Sequences. Proc. of Int’l Conf. on Knowledge Discovery and Data Mining, pp. 210–215. (1995)

[MIT] MIT Media Lab: Reality Mining,

http://reality.media.mit.edu/

[Nijssen 04] S. Nijssen and J.N. Kok. A Quickstart in Frequent Struc-ture Mining can Make a Difference. Proc. of Int’l Conf. on Knowl-edge Discovery and Data Mining, pp. 647–652. (2004)

[Pei 01] J. Pei, et. al. PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth, Proc. of Int’l Conf. on Data Engineering, pp. 2–6. (2001)

[Venema 01] Y. Venema. Temporal Logic, Goble, Lou, ed., The Blackwell Guide to Philosophical Logic. Blackwell (2001) [Yan 02] X. Yan & J. Han. gSpan: Graph-Based Substructure Pattern

Mining. Proc. of Int’l Conf. on Data Mining, pp. 721–724. (2002)