情報知識ネットワーク特論 Data Mining 5:
Tree & Graph Mining algorithms
有村 博紀 Hiroki Arimura
北海道大学大学院 情報科学研究科情報理工学専攻 Division of CS, Grad. School of Inf. Sci. & Tech. Hokkaido University email: {arim,kida}@ist.hokudai.ac.jp
Slide http://www-ikn.ist.hokudai.ac.jp/ikn-tokuron/
. Updated: 01 DEC 2015
今日の内容
5
回:
木とグラフのマイニング/ Mining of Trees and Graphs
l 半構造データ / Semi-structured data
l 頻出順序木マイニング /Frequent Ordered Tree Mining
• ラベルつき順序木(ordered trees)
• 最右拡張手法(Rightmost expansion)
• アルゴリズム:FREQT
l 頻出無順序木マイニング/ Unordered tree mining l グラフマイニング / Graph mining
ポイント
l 木の列挙 / Enumeration of Trees
半構造マイニングとは何か?
WHAT IS SEMI-STRUCTURED DATA MINING? (OR GRAPH &
TREE MINING)
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
7
Semi-structured Data Mining
l have emerged in 90’s with the rapid progress of computer and network technologies.
l Web pages, collections of XML documents.
l Biological databases (Entrez, Genbank, PubMED) l Sematic Web / Networks of Ubiquitous Knowledge
l Demands for efficient methods to discover knowledge from large semi-structured data l However, Semi-structured data are ....
l Huge, Heterogeneous collections of Weakly-structured data
l Traditional data mining technology cannot work!
l Data Mining for Sequence, Trees, and Graphs
<people>
<person age=“25” id=“608”>
<name>John</name>
<email>[email protected]</email>
</person>
<person id=“609”>
<name>Mary</name>
<tel>555-4567</tel>
</person>
</people>
email people
@age @id
60 8
name
#text
#text person
name
@id
60 9
tel
#text
#text person
2 5
n 555-4567
Mary
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
DEWS2003 浅井達哉 「半構造データマイニングに関する研究動向」 8
与えられたラベルつき木やグラフの集合から,
特徴的な部分グラフ(パターン)を発見すること
半構造データマイニングとは?
n
特徴的なパターンl 頻出パターン
多くの文書に共通して出現するパターン l 最適パターン
2種類の文書集合を,うまく区別するパターン
B P
#text FONT #text #text
@color @face #text
blue Times
Minsup = 5(%)
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
9
半構造データマイニングの応用
n
スキーマ発見n
情報抽出n
文書分類n
検索の効率化n
データ圧縮n
複雑な構造をもつデータからの知識発見l 化学物質
l 遺伝子ネットワーク l ネットワークログ
l インターネットのリンク構造 l 電子回路
半構造マイニングの歴史 HISTORY OF SEMI-
STRUCTURED DATA MINING
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
12
半構造データマイニングのさきがけ
n Holder, Cook, Djoko [KDD’94]
l グラフマイニングシステムSUBDUE
l すべての出現を取り除いて得られるグラフの大きさが 最小となるような部分グラフを発見(MDL原理)
l グラフの圧縮に応用
n Nestrov, Abiteboul, Motwani [SIGMOD’98]
l 論理プログラムを用いたグラフマイニング
l 入力グラフをできるだけ広く覆うパターンを計算 l 半構造データからのスキーマ発見に応用
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
DEWS2003 浅井達哉 「半構造データマイニングに関する研究動向」 13
n Wang & Liu [KDD’97; TKDE, 2000] , Cong, Yi, Liu, Wang [SDM’02]
l 木を経路の集合とみなして,頻出パス列を発見
l 各経路をアイテムとみなしたApriori風のアルゴリズム
n Miyahara, et al. [PAKDD’01 , PAKDD’02]
l グラフ変数つき順序木・無順序木パターンを発見 l 高速なマッチング手法+単純な生成テスト法
木構造データからの頻出パターン発見手法
X
Y
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
14
グラフ構造データからの頻出パターン発見手法
n Inokuchi, Washio, Motoda [PKDD’00]
l AGM: “An Apriori-based Algorithm for Mining Frequent Substructures from Graph Data”
n Kuramochi & Karypis [ICDM’01, ICDM’02]
l FSG: “Frequent Subgraph Discovery”
n Vanetik, Gudes, Shimony [ICDM’02]
l “Computing frequent graph patterns from Semi-structured data”
n Yan & Han [ICDM’02]
l gSpan: “gSpan: Graph-based substructure pattern mining”
H
H H X X
C C
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
DEWS2003 浅井達哉 「半構造データマイニングに関する研究動向」 15
頻出パス列発見手法 [Wang and Liu (KDD’97)]
email people
@age @id
608
name
#text
#text person
name
@id
609
tel
#text
#text person
25
John
555-4567
Mary
n 順序木をパスの列に分解
n 各パスをアイテムとみなして,Apriori風に頻出パス列を発 見
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
16
頻出パス列発見手法 [Wang and Liu (KDD’97)]
email people
@age @id
608
name
#text
#text person
name
@id
609
tel
#text
#text person
25
John
555-4567
Mary
person person person person person people people people people people people
@id person
name
#text person
頻出パス列 を発見 Apriori
@id name
#text person
再 構 造 化
n 順序木をパスの列に分解
n 各パスをアイテムとみなして,Apriori風に頻出パス列を発 見
パスのワイルドカードを扱えるよう
拡張 [Cong, Liu, Yi, Wang (SDM2002)]
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
DEWS2003 浅井達哉 「半構造データマイニングに関する研究動向」 17
頻出パス列発見手法 [Wang and Liu (KDD’97)]
この手法の問題点
l
兄弟が同一ラベルをもつ場合,木の枝分かれ構 造が失われるl Apriori
風に解候補の探索を行うため,効率が悪い
すべての頻出順序木を厳密に発見する 計算効率のよいアルゴリズムが必要
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
18
n FREQT [Asai et al. (SDM’02, PKDD’02)]
l 最右拡張を用いた効率のよい順序木の枚挙 l 最右葉出現の漸増的な更新
n Treeminer [Zaki (SIGKDD’02)]
l 効率のよい順序木の枚挙 l 我々の研究とは独立
頻出順序木パターンを発見する 効率のよいアルゴリズム
木構造データからの頻出パターン発見手法
• Efficient Substructure Discovery from Large Semi-structured Data, Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Arimura, Hiroshi Sakamoto, Setsuo Arikawa, Proc. Second SIAM International Conference on Data Mining 2002 (SDM‘02), 158-174, SIAM, 2002.
• Optimized Substructure Discovery for Semi-structured Data, Kenji Abe, Shinji Kawasoe, Tatsuya Asai, Hiroki Arimura, Setsuo Arikawa, Proc. 6th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD-2002), LNAI 2431, Springer, 1-14, August 2002.
• Mohammed J. Zaki, Efficiently mining frequent trees in a forest, KDD'02, ACM, 71-80, 2002. (Also appeared in: IEEE Trans. Knowl. Data Eng. 17(8), 1021-1035, 2005)
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
20
Efficient Algorithms for Mining
Frequent Ordered Trees from Semi- structured Data
Hiroki Arimura (Graduate School of IST, Hokkaido University, Japan)
Joint work with
Takeaki Uno (National Institute of Informatics) Tatsuya Asai* (Fujitsu Labs. Co. LTD.),
Shin-ichi Nakano (Gunma University)
Partly Supported by: Grant-in-aid for Scientific Researchs on Specially Promoted Research "Semi- structured Mining" and Priority Area “Information Explosion” (2007)
* Presently, working for Fujitsu Lab. Co. LTD.
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
Our Research Goal
n Knowledge discovery methods that
supports human’s discovery from large semi-structured data
n Key: Efficient semi-structured data
mining algorithm in theory and practice
l Having theoretical performance guarantee l Fast and Lightweight in practice
User
Semi-structured Data
Data Mining Engine
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
22
History of Semi-structured Data Mining
~1995 1996 1997 1998 1999 2000
2001
2002
2003
Algorithm for finding subgraphs by MDL principle Subdue [Holder et al. (KDD’94)]
Finding frequent paths [Wang and Liu (KDD’97)]
Finding Semi-structured Schema
[Nestrov, Abiteboul et al. (SIGMOD’98)]
Finding frequent subgraphs
AGM [Inokuchi, Wahio, Motoda (PKDD’00, MLJ. 2003)]
Finding frequent / optimal ordered trees
FREQT [Ours (SDM’02)],Treeminer [Zaki (KDD’02)]
Finding frequent subgraphs
gSpan [Yan and Han (ICDM’02)], VG [Venetik, et al.(ICDM’02)]
Finding frequent un-ordered trees
UNOT [Ours (SDM’03)],NK [Nijssen, Kok (MGTS’03)]
FSG [Kuramochi et al. (ICDM’01)
Frequent Maximal/Closed Trees & Graphs mining [Yan&Han '03; Termier et al.'04] and many algorithms in 2000s
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
23
Tree Mining
Association Rule, Frequent Itemsets
Simple
Efficient algorithms
Expressive
Computationally Hard General graphs
(AGM, FSG, gSpan)
Trees Development of
Efficient tree mining algorithms
Expressive patterns & Efficient algorithms
PAKDD2008, invited talk, 21 May 2008, Hiroki Arimura, Hokkaido University
24
The first question
when we strarted in 2001
How to define
a Tree Mining problem?
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
Itemset Mining Revisited
l
Frequent Itemset Mining
l Finding all "frequent" sets of elements (items) appearing no more than σ times in a given transaction data base [Agrawal, Srikant, VLDB'94]
l Most popular data mining problem and basis for others
1 2 3 4 5
t1 ○ ○
t2 ○ ○
t3 ○ ○ ○ ○
t4 ○ ○ ○
t5 ○ ○ ○
1 2 3 4 5
t1 ○ ○
t2 ○ ○
t3 ○ ○ ○ ○
t4 ○ ○ ○
t5 ○ ○ ○
database minsupσ= 2
The itemset lattice (2Σ, ⊆)
Frequent sets
Frequent sets
∅,
1, 2, 3, 4, 12, 13, 14, 23, 24, 124
X = {2, 4} appears three times, thus
frequent
people: [ person: {
age: “25”, id: “608”,
name: John, email: [email protected] }, person: {
id: “609”, name: Mary, tel: 555-4567 } ]
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
26
半構造データの例
/ Examples
<people>
<person age=“25” id=“608”>
<name>John</name>
<email>[email protected]</
email>
</person>
<person id=“609”>
<name>Mary</name>
<tel>555-4567</tel>
</person>
</people>
email people
@age @id
60 8
name
#text
#text person
name
@id
60 9
tel
#text
#text person
2 5
Joh
n 555-45
Mar 67
y
l HTML & XML documents l JSON texts
l Feature structures in NLP l Attribute graphs in Computer
Graphics
JSON text **
** JSON: Javascript Object Notation, www.json.org/
XML document *
* XML: Extended Markup Language, http://www.w3.org/TR/xml11/
people: [ person: {
age: “25”, idj: “608”,
name: John, email:[email protected] }, person: {
id: “609”, name: Mary, tel: 555-4567 } ]
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
DEWS2003 27
半構造データのモデル
n ラベルつき順序木 / Labeled Ordered Trees
l Each node has a label, which corresponds to:
• Markup tag
• Attribute & value
• Text string
l The children of a node are ordered from left to right by the sibling
relation
l Each node can have unbound number of children (unranked)
n Other class of semi-structured data
l ラベルつき無順序木 / Labeled ordered trees
l ラベルつきグラフ / Vertex-labeled and edge-labeled graphs
<people>
<person age=“25” id=“608”>
<name>John</name>
<email>[email protected]</
email>
</person>
<person id=“609”>
<name>Mary</name>
<tel>555-4567</tel>
</person>
</people>
email people
@age @id
60 8
name
#text
#text person
name
@id
60 9
tel
#text
#text person
2 5
Joh
n 555-45
Mar 67
y
n 半構造データの例 / Examples
l HTML & XML documents l JSON texts
l Feature structures in NLP l Attribute graphs in Computer
Graphics
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
29
n φ is 1-to-1.
n φ preserves parent-child relation.
n φ preserves (indirect) sibling relation.
n φ preserves labels.
Matching between trees
Pattern tree T matches a data tree D
(T occurs in D )
There is a matching
function φ from T into D .
r
C B A
B A
C B
data tree D
pattern tree T
1
2
3 4 5
6
7
8
9 10
11
A
C B
A
C
B A
C B
A
C
B A
C B
matching function φ
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
30
Frequency of a pattern tree
Root occurrence list OccD(T) = {2, 8}
• A root occurrence of pattern T:
• The node to which the root of T maps by a matching function
• The frequency fr(T) = #root occurrences of T in D
r
C
B A
C B
A
A C
B
B
P
1P
2A
C B
T D
1
2
3 4 5
6
7
8
9 10
11
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
31
Frequent Tree Mining Problem
n Given: a colection S of labeled ordered trees and a minimum frequency threshold s
n Task: Discover all frequent ordered trees in S (with frequency no less than s) without
duplicates
A minimum frequency threshold (min-sup) s = 50%
PAKDD2008, invited talk, 21 May 2008, Hiroki Arimura, Hokkaido University
32
The second question
How to efficiently solve
the Tree Mining problem?
33 PAKDD2008, 21
May 2008, Hiroki Arimura, Hokkaido University
Theoretical performance guanrantee
33
n Output-polynomial (OUT-POLY) l Total time = poly(N, M)
n polynomial-time enumeration, or amortized polynomial-delay (POLY- ENUM)
l Amotized delay is poly(Input), or l Total time = M·poly(N)
n polynomial-delay (POLY-DELAY) l Maximum of delay is poly(Input)
How to measure the time complexity of mining algorithms?
l Exponentially many answers!
l As enumeration algorithms
+ polynomial-space (POLY-SPACE)
Output size M
Delay D
Input
Input size M
Total Time T Algorithm
Outputs
Time
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
Itemset Mining Revisited
Backtrack algorithm [1997-1998]
• Depth-first search (DFS)
• Vertical data layout
1 2 3 4 5
t1 ○ ○
t2 ○ ○
t3 ○ ○ ○ ○
t4 ○ ○ ○
t5 ○ ○ ○
database
Apriori algorithm [1994]
• Breadth-first search (BFS)
• Horizontal data layout
• 2nd generation
• In-core algorithm
• Space efficient
• 1st generation
• External memory algorithm
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
35
Frequent Ordered Tree Mining
n How to enumerate all structured patterns without duplicates
n Rightmost expansion technique
n Efficient DFS Algorithm
l Freqt [Asai, Arimura, '02]
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
36
Efficent Algorithms
n Depth-first mining algorithm for frequent ordered trees
l Freqt [Asai, Arimura et al., SIAM DM'02]
l TreeMiner [Zaki, KDD'02]
n Performance guarantee
l Polynomial time enumeration per pattern l Small memory footprint
n Key technique: Rightmost expansion
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
37
Key: How to enumerate Ordered Trees without Duplicates?
n A naive algorithm
l Starting from a one-node tree
l Grow a pattern tree by adding a new node one by one
n Drawback
l Exponentially many different ways to generate a same pattern tree
l Explicit duplication test needed
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
38
演習
n 問題 1: サイズがk=0,1,2,3,4の順序木(ラベルなし)を すべて書き出せ.
n Problem 1: Print all (un-labeled) ordered trees of size k, k up to 4 (in the increasing order of their sizes)
n 問題2:入力kに対して,サイズがkの順序木(ラベルな し)をすべて書き出せをすべて出力するプログラムをか け.
n Problem 2 (Difficult!) Give a computer program that, given a number k >= 0, prints (enumerates) all
(unlabeled) ordered trees without duplicates.
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
39
Itemset mining revisited: BFS vs. DFS
Apriori algorithm [1994]
• Breadth-first search (BFS)
• Horizontal data layout
• 1st generation
• External memory algorithm
• 2nd generation
• Space efficient in- core algorithm
Backtrack algorithm [1997-2000]
• Depth-first search (DFS)
• Vertical data layout
• MaxMiner, Eclat, FP-growth etc.
1 2 3 4 5
t1 ○ ○
t2 ○ ○
t3 ○ ○ ○ ○
t4 ○ ○ ○
t5 ○ ○ ○
1 2 3 4 5
t1 ○ ○
t2 ○ ○
t3 ○ ○ ○ ○
t4 ○ ○ ○
t5 ○ ○ ○
database
How to
implement DFS for ordered tree
patterns?
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
40
An idea: Encode a tree as a sequence
n Encode an ordered tree T
l by its depth-label sequence ((d(v1), l(v1)) , … , (d(vk), l(vk)) in the preorder traversal of T
ordered tree T
A
B B
B A
C
C
0 1 2 3 depth
id 1 2 3 4 5 6 7 dep,lab 0A 1B 2A 3C 2B 1B 2C
code for T
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
41
Rightmost Expansion Technique for Ordered Trees
n Starting from a one node tree
n Grow the tree by
l Attaching a new node v
l to the rightmost branch
l to be the
youngest sibling
1
T
k-1l
p k
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
42
A family tree for frequent ordered trees
⊥
A
B
A A
A B
B A
B B
B
B A
B
B B
B B B B
B A B
A A
B
A B
infrequent
infrequent frequent
frequent
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
43
DFS for frequent ordered trees
n Enumarate all frequent ordered trees by backtracking n A family tree based on the rightmost expansion
⊥
A
B
A A
A B
B A
B B
B
B A
B
B B
B B B B
B A B
A A
B
A B
infrequent
infrequent frequent
frequent
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
44
How to incrementally compute the frequency of each pattern?
n Use RMO (rightmost-leaf occurrence)
Data tree D A
T
CRMO(T)
={5,12,14,19,22}
A
C
T’
B The rightmost expansion
RMO(T’)={6,17}
B r
C B A
C B
C C B
B A C
A A C
B
A C C A C
1
2
3 4
5 6
7 8 9
10 11 12
13 14
15
16 17
18 19 20 21
22
B RMO(T) B
RMO(T’)
Pattern
浅井達哉「半構造データマイニングに関する研究動向」 45 DEWS2003 浅井達哉「半構造データマイニングに関する研究動向」
FREQT の性能評価
n
Scalability
q Dataset: citeseers
q Minimum support: σ=3.0(%) fixed
q Increasing the data size from 0.3MB to 5.6MB.
0 0.5 1 1.5 2
0 50000 100000 150000 200000
Number of nodes in a data tree
Runtime (sec)
178,285ノードに対して 1.39秒
# of nodes Runtime
(sec)
最適パターン発見アルゴリズム
OPTT
[Abe et al. (PKDD ’ 02)]
n 最適パターン発見では、以下をみたすパターンを見つける
q positiveに頻出,negativeに非頻出
n positiveにもnegativeにも同程度で出現するパターンは 見つけない
n 木やグラフの分類に応用
Positive data Negative data
Pattern P
matched
unmatched
浅井達哉「半構造データマイニングに関する研究動向」 47 DEWS2003 浅井達哉「半構造データマイニングに関する研究動向」
最小化
• 分類誤差
• 情報エントロピー
• Gini 指標
データマイニング Optimized data mining (IBM DM group,1996 - 2000) 計算学習理論 Agnostic PAC learning (1990s)
統計学 Vapnik&Chervonenkis theory (1970s)
f(p): 不均衡関数
p: パターンが出現 する正例の割合
良いパターン 8 positives 2 negatives 悪いパターン
9 positives 9 negatives
50% 100%
0%
最適パターン発見問題
応用例:
FREQT
ウェブページからの頻出部分構造発見
<a href=“_”>
<font color=“#6F6F6F”> #text_1 </font>
</a>
<p> #text_2 <b>
#text_3 <!-- CITE-->
<font color=“green”> #text_4 </font>
#text_5 </b>
#text_6 <br /><br />
<font color=“#999999”> #text_7 <i> #text_8 </i>
#text_9 </font>
</p>
実験1:CGI生成された ウェブページからの反復 構造抽出
実験2:映画データベースからの 構造抽出
データ:IMDB (アクション映画×50) データサイズ:1.05 MB/102,493 ノード パターンサイズ:1218ノード(サポート50%) 深さ優先探索版FREQT
• スキーマ発見に有効
• DataGuide [Widom, Garcia-Molina et al. (VLDB’97)]
浅井達哉「半構造データマイニングに関する研究動向」 49 DEWS2003 浅井達哉「半構造データマイニングに関する研究動向」
応用例:
OPTT
XML
データからの最適パターン発見正例:アクション映画15タイトル 負例:ファミリー映画15タイトル
見つかった最適パターンの例
出現数:正例 10,負例 0
<movie>
<certification>
<certif> sweden:15 </certif>
</certification>
</movie>
出現数:正例 1,負例 12
<movie>
<title />
<genre> animation <genre>
</movie>
アルゴリズム
OPTT
• 文書分類に有効
Treeminer [Zaki (SIGKDD ’ 02)]
n 効率のよい頻出順序木パターン発見 アルゴリズム
n
FREQT
と類似q 順序木の枚挙法は,
FREQT
と同じ最右拡張q 木のマッチングは,
FREQT
と異なるn Treeminerでは,親子関係にギャップを許した マッチングを採用
n 我々の研究とは独立
浅井達哉「半構造データマイニングに関する研究動向」 51 DEWS2003 浅井達哉「半構造データマイニングに関する研究動向」
応用例:
ウェブログからのアクセスパターン発見
あるユーザのウェブアクセスログ
<userSession name=”ppp0-69.ank2.isbank.net.tr” …>
<uedge source=“5938” target=“16470” utime=“7:53:46”/>
<uedge source=“16470” target=“24754” utime=“7:56:13”/>
<uedge source=“16470” target=“24755” utime=“7:56:36”/>
<uedge source=“24755” target=“47387” utime=“7:57:14”/>
<uedge source=“24755” target=“47397” utime=“7:57:28”/>
<uedge source=“16470” target=“24756” utime=“7:58:30”/>
5938 16470 24754 24755
47387 47397
24756
ウェブ閲覧木 実験:
1009件のウェブアクセスログから
頻出パターンを発見 Path/sair_listesi.html
Path/poems/akgun_akova/index.html
Akova/picture.html Akova/contents.html Akova/biyografi.html
見つかった パターンの例
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
52
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
53
Frequent Unordered Tree Mining
n How to enumerate all structured patterns under isomorphism
n Canonical form technique
n Efficient DFS Algorithm
l Unot [Asai, Arimura, DS'03]
l NK [Nijssen, Kok, MGTS’03]
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
54
Previous Works:
Frequent Pattern Discovery for Graphs
Algorithms for discovering all the
frequent substructures from graphs
n AGM [Inokuchi, Washio, Motoda (PKDD’00)]
l Computing adjacent matrices for frequent subgraphs.
n FSG [Kuramochi, Karypis (ICDM’01)]
l Computing adjacent matrices for frequent subgraphs.
n [Vanetik, Gudes, Shimony (ICDM’02)]
l Joining frequent paths.
n gSpan [Yan, Han (ICDM’02)]
l Computing DFS trees for frequent subgraphs.
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
55
Efficient DFS mining algorithm for frequent unordered trees
n UNOT: an efficient algorithm that finds all the frequent labeled unordered trees in a given database.
l Efficient enumeration of unordered trees in O(1) time per tree without duplicate
l Incremental computation
A
T1 A T3 T4
B
B BB B B A A C C
C C
A A B
B BB B B A A C C C C
A A B
B BB B B AA
C C C
C A
T2 A B
B BB B
B AA C C
C
>
CPAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
56
Difficulties in Mining Unordered Trees
1. How to define unique representation
2. How to generate all the patterns without duplications
3. How to compute the occurrences
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
57
How to define the canonical representation ? Leftmost-biased Tree
n The code for an ordered tree T : the depth-label sequence in priorder traversal of T
n Code(T) = ((depth(v
1), label(v
1)) , … , (depth(v
k), label(v
k))
T
AB B
B A
C
C
Code(T) = (0A,1B,2A,3C,2B,1B,2C)
n The ordered tree T with
lexicographically maximum
code
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
58
Leftmost-biased Tree
T
1 AT
3T
4B B
B A
C
C
A
B B
B A
C C
A
B B
B A C C
n The ordered tree T with
lexicographically maximum code
T
2 AB B
B A C
C
(0A,1B,2A,3C,2B,1B,2C) (0A,1B,2B,2A,3C,1B,2C) (0A,1B,2C,1B,2A,3C,2B) (0A,1B,2C,1B,2B,2A,3C)
>
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
59
How to avoid duplications ? Rightmost Expansion
n We can prove that the rightmost expansion only generates
leftmost-biased trees only
Rightmost expansion
SAME
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
60
Frequent Unordered Tree Mining
n How to enumerate all structured patterns under isomorphism
n Canonical form technique
n Efficient DFS Algorithm
l Unot [Asai, Arimura, DS'03]
l NK [Nijssen, Kok, MGTS’03]
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
61
グラフ構造データからの
頻出パターン発見
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
62
AGM
[Inokuchi, Washio, Motoda (PKDD’00)]n AGM = Apriori-based Graph Mining
n
隣接行列表現を用いて,Apriori
風に頻出部分グラ フを発見するアルゴリズム
n
パターンの生成法l
隣接行列の正準形を効率よく計算Cl Cl
Cl C C H
Cl C Cl C Cl H Cl 0 1 0 0 0 0 C 1 0 1 2 0 0 Cl 0 1 0 0 0 0 C 0 2 0 0 1 1 Cl 0 0 0 1 0 0 H 0 0 0 1 0 0
Code = 101020000100010
隣接行列の正準形 行と列を入れ替えて Codeが最小となる ような隣接行列
トリクロロエチレン
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
DEWS2003 浅井達哉 「半構造データマイニングに関する研究動向」 63
FSG
[Kuramochi & Karypis (ICDM’01)]n FSG = Frequent SubGraph discovery
n
隣接行列表現を用いて,Apriori
風に頻出部分グラ フを発見するアルゴリズムv1 v2 v3 v4 v1 0 1 0 0 v2 1 0 1 2 v3 0 1 0 0 v4 0 2 0 0
A
A B
v2 A
v1
v3 v4
v1 v4 v3 v2 v1 0 0 0 1 v4 0 0 0 2 v3 0 0 0 1 v2 1 2 1 0 各ノードを,ラベルと ランクの組ごとに分割
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
64
[Vanetik, Gudes, Shimony (ICDM’02)]
n
頻出パスにbijective sum
とsplice
の操作を繰り返し,Apriori
風に頻出部分グラフを発見するアルゴリズムBijective sum Splice
PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
DEWS2003 浅井達哉 「半構造データマイニングに関する研究動向」 65
gSpan
[Yan & Han (ICDM’02)]n gSpan = graph-based Substructure pattern mining
n
頻出部分グラフ発見アルゴリズムn
グラフのDFS
木を効率よく生成PAKDD2008, 21 May 2008, Hiroki Arimura, Hokkaido University
66
応用例:
化学物質データからの知識発見
化学物質の発ガン性を判定するルールを発見
データ
300種類の化学物質
(うち185種類が発ガン性あり) H X
C C
C C
C C
H
発ガン性あり
H
H H
X X
C C
発ガン性あり 見つかったルール
今日の内容
5
回:
木とグラフのマイニング/ Mining of Trees and Graphs
l 半構造データ / Semi-structured data
l 頻出順序木マイニング /Frequent Ordered Tree Mining
• ラベルつき順序木(ordered trees)
• 最右拡張手法(Rightmost expansion)
• アルゴリズム:FREQT
l 頻出無順序木マイニング/ Unordered tree mining l グラフマイニング / Graph mining
ポイント
l 木の列挙 / Enumeration of Trees