有村博紀北海道大学

(1)

1 2011/09/30 列挙学校（再放送）

有村博紀，北大

有村博紀北海道大学

大学院情報科学研究科

教材 PPT:

http://www-ikn.ist.hokudai.ac.jp/~arim/

rekkyo/rekkyo2011.ppt

列挙学校：第４コマ

パターンマイニングにおける列挙アルゴリズム

改訂版ver12: 2011/09/30

(2)

2 第１回2008 年9月12日

GCOE公開講座

自己紹介



氏名：　有村博紀（ありむらひろき）

 九州大学理学部物理学科を卒業 (1988 年卒 )

 北海道大学大学院情報科学研究科 (2004 ～ )

 妻あり子供二人ありの 46 才

 2004 年から北海道在住



主な研究経歴

 機械学習

 データマイニングアルゴリズム

 情報検索アルゴリズム

 大規模離散構造データ

（系列，木，グラフ，．．）を効率良く扱いたい

知識発見

複雑・小規模

単純・大規模

(3)

3 2011/09/30 列挙学校（再放送）

今回の講義：目次



４コマ目の目的：

データマイニングを題材に

 列挙はどんなことに使えそうか？

 応用分野では，列挙はどのように使われているか？

 データマイニング (DM) と機械学習における列挙に関連した話題を提供

 DM の歴史も少し紹介

http://www.sigkdd.org/kdd/1995/poster/kdd1995-poster-thumb.jpg

(4)

4 2011/09/30 列挙学校（再放送）

パート 1: データマイニングと頻出集合発見

 データマイニング

 頻出集合マイニング

パート２ : 構造データのマイニング

 木とグラフのマイニング

 列挙によるパターンのマイニング

 関連した話題

(5)

5 2011/09/30 列挙学校（再放送）

列挙学校：第４コマ

パート１ : データマイニングと頻出集合発見有村博紀

北海道大学大学院情報科学研究科コンピュータサイエンス専攻 email: {arim,kida}@ist.hokudai.ac.jp

http://www-ikn.ist.hokudai.ac.jp/~arim

．

(6)

データマイニングとは

 大量のデータから人間にとって有用なパターンや規則を半自動的にとりだす方法の研究

 1990 年代半ばから研究が盛んになった

• Apriori algorithm

[Agrawal, Srikant, VLDB1994]

 潜在的には古くからある研究の集大成．

• ただし，大量データに対する効率的計算に重点

 機械学習・数理統計学・

データベース技術の境界分野

2011/09/05, 富士通研勉強会，北海道大学, 有村博紀

6

知識 Knowledge

H XXHCC H

H H X XC C

半構造データ Semi-structured Data

半構造パターン発見 Pattern Discovery

2. 知識を

パターンや規則として発見

2. 知識を

1. 対象領域

の理解・データの前処理

1. 対象領域

の理解・データの前処理

3. 発見し

た知識の利用

3. 発見し

た知識の利用

(7)

8 2011/09/30 列挙学校（再放送）

Backgrounds

データマイニングのプロセスの全体

 1. 対象領域の理解

 2. データ集合の前処理

 3. パターンの発見（狭義のデータマイニング）

 4. 得られたパターンの解析

 5. 解析結果の利用

(8)

9 2011/09/30 列挙学校（再放送）

Finding/Discovering hidden knowledge from massive data

私のデータマイニングのイメージ

<vessels >

<shipping >

<iranian >

<port >

<iran >

<the gulf >

<strike >

<attack >

<silk worm missile>

<u.s.>

<wheat>

<dallers>

<sea men>

<strike >

<ships>

<gulf >

キーワード検索情報検索

直接目で見る

データマイニング

(9)

10 2011/09/30 列挙学校（再放送）

有用な規則・パターン・知識

HH H XXCC H H

H

XXCC 大規模データ

データマイニング

データマイニングの研究動向

不完全なデータから，

未知の規則を学習する

 SVM [Vapnik ‘96],

 Boosting [Shapire & Kearns

‘96]

C4.5 [Quinlan ‘96]

予測学習・自動分類

トランザクションデータから共通して出現する規則性を発見する

頻出パターン発見 [Agrawal et a. ‘94]

最適化マイニング [森下 ’ 96, ’98, ‘00]

パターン発見

非定型構造データから特徴的な部分構造を規則性を発見する

グラフマイニング

[Washio & Motoda ‘00], [Zaki ’02], [Uno, Asai, Arimura, ’02, ‘03]

構造マイニング

高次元大規模データから不確実な現象を予測・モデル化する

 ベイジアンネットワーク [Pearl ’90s]

 HMM [Asai], MCMC, ベイズ推定・MDL・ AIC

確率モデリング

テキストマイニング

自然言語テキスト情報抽出

意味マイニング

ストリームマイニング

センサー監視近似統計処理

新しいタイプのデータマイニング

データを類似したものどうしグルーピングする．

 大規模・不完全なデータからの高速クラスタリング

 K-means, CLARANS, DBSCAN

クラスタリング

(10)

10 Algorithms in Data Mining

• C4.5 -

決定木

• k-Means -

クラスタリング

• SVM ^-

計算学習

• Apriori -

頻出集合

• EM ^-

統計学習

• PageRank ^-

スペクトル

• AdaBoost ^-

計算学習

• kNN ^-

統計学習

• Naive Bayes ^-

統計学習

• CART ^-

計算学習

11 情報知識ネットワーク特論 DM1

In an effort to identify some of the most influential algorithms that have been widely used in the data mining community, the IEEE International Conference on Data Mining (ICDM) identified the top 10 algorithms in data mining for

presentation at ICDM '06 in Hong Kong.

X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, P.

S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand and D. Steinberg: Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37.

(11)

クラウド

「集中」

 大量のデータ

 多数のＣＰＵ

 高速なネットワーク

 膨大な計算

これからのデータマイニング

「ポストペタスケール時代」

「分散」

 さまざまなデバイス

 多様な人間活動と応用

 多様で非均一な時空間

 不完全で複雑なデータと情報

12 2011/09/30 列挙学校（再放送）

「ビッグデータ」

(IBM

^TM

:-)

• 超大規模で

• 不均質かつ不完全な

• 非構造データ

どうやってあつかうか？

「ビッグデータ」

(IBM

^TM

:-)

• 超大規模で

• 不均質かつ不完全な

• 非構造データ

どうやってあつかうか？

人間

(12)

情報に関する最近の話題情報に関する最近の話題

ワトソン君

• IBM リサーチ (2011/02/16)

• クイズ番組で人間に勝利！

• １００万冊の本を読んで回答

• 人工知能と自然言語，アルゴリズム，検索の技術で

クラウド計算

世界中の情報を計算！

From geek.com: Google server firm

http://www.geek.com/articles/chips/up-next-for-google- enterprise-wars-2009078/

From http://www-

06.ibm.com/ibm/jp/lead/ideasfromibm/watson/

米国の人気クイズ番組「ジョパディ！」のワトソン君クイズ王に挑戦中

2011/09/30 列挙学校（再放送）有村博紀，北大

(13)

14

データマイニング

知識 Knowledge

H H H XXCC H

H H X XC C

2.

知識を

2.

知識を

知識を有機的につなげる知識を有機的に

つなげる

3.

新たな

知識の構築

3.

新たな

知識の構築

 データマイニング



大量のデータから有用なパターンや規則を半自動的にとりだす方法の研究

 1990

年代半ばに顕在化．定型データを対象とする

 非構造／半構造データ

 多様な構造をもつ膨大なデータ

 新しいデータマイニング技術が必要 !

文科省科研費特定「発見科学」(H10- H12), 「情報学」(H13-H17), 基盤研究 B(H12-H14, H15-H17)等

文科省科研費特別推進研究（平成１７年～

１９年）「知識基盤形成のための大規模半構造データからの超高速パターン発見」

文科省GCOEプログラム「知の創出のための次世代情報基盤拠点」（H19-H23）

(14)

データマイニングの二つの柱

列挙

データに含まれるパターンを網羅的に見つける

パターン発見手法

最適化

目的関数を最大化するパターンを求める

機械学習手法

相補的な関係

と

_{湊真一先生}

（北大＆ ERATO ）

「列挙と最適化は車の両輪」 by S. Minato

(15)

16 2011/09/30 列挙学校（再放送）

パート 1 の１：

頻出集合マイニングの枠組み

～ (Agrawal & Srikant,1994)

(16)

Backgrounds



Frequent Itemset Mining

 Finding all "frequent" sets of elements (items) appearing no more than σ times in a given

transaction data base.

 Introduced by Agrawal and Srikant [VLDB'94]

 One of the most popular data mining problem

 Basis for more complicated / sophisticated data

mining problems

(17)

動機：結合ルールマイニング

トランザクション／レコードの意味

「レコード

003

の顧客は，ポテトチップとソーセージを一緒に買った」

 Association Rule Mining [Agrawal 1993/1994]

• Finding combination of “items” frequently appearing in a given database

• トランザクションデータ

• バスケットデータ

• 二値データベース

• レコード / タ

プル

• バスケット

•カラム / 属

性

•アイテム

(18)

ID Chips Mustard SausageSoftdrink Beer

001 1 0 0 0 1

002 1 1 1 1 1

003 1 0 1 0 0

004 0 0 1 0 1

005 0 1 1 1 1

006 1 1 1 0 1

007 1 0 1 1 1

008 1 1 1 0 0

009 1 0 0 1 0

010 0 1 1 0 1

アイテム集合

X

の意味

X =

「マスタードとソーセージ，ビールを一緒に買う人」が全体の

40

％いた

動機：結合ルールマイニング

 Frequent Itemset X = { Mustard, Sausage, Beer }

• with support/frequency 40%

•アイテム集合 X

•X の出現リスト

Occ(X) = {002, 005, 006, 010}

•X の頻度（サポート）

freq(X) = |Occ(X)|

= 4/10 = 40%

(19)

頻出アイテム集合マイニング

Frequent Itemset Mining Problem

 Given: A database DB over a set Σ of items, and a number σ

≧ 0 called “minsup” (minimum support)

 Problem: Find all itemsets X⊆ Σ appearing in no less than σ records of DB

20 20

(20)

21 2011/09/30 列挙学校（再放送）

Definitions: Database

 A set Σ = { 1, ..., n } of items (elements)

 Transaction database

− A set T = { t

₁

, ..., t

_m

} of subsets of Σ

− Each subset t ⊆ Σ is called a tuple (record)

21

id tuples 1 1, 3

2 2, 4

3 1, 2, 3, 4 4 1, 2, 4

21

I = {1,2,3,4}

Alphabet of items

Transaction database

(21)

22 2011/09/30 列挙学校（再放送）

Definitions: Itemset lattice

22

 Item set: any subset X ⊆ Σ = { 1, ..., n }

 (Item) set lattice L = (2

^Σ

, ⊆ )

− The power set 2

^Σ

= { X : X ⊆ Σ }

− The subset relation⊆ over 2

^Σ

Example:

The set lattice

forΣ = {1,2,3,4}

(22)

23 2011/09/30 列挙学校（再放送）

Definitions: Frequent sets

 An itemset X appears in a tuple t: X ⊆ t

 The occurrence of X in a database T:

Occ(X, T) = { t ∈ T : X ⊆ t }

 The frequency of X: Fr(X, T) = | Occ(X, T) |

 Minimum support (minsup): an integer 0 ≦ σ ≦ |T|

 X is σ-frequent (frequent) in T if Fr(X, T) ≧ σ.

23 23

I = {A,B,C,D}

Alphabet of items Transaction database

Occ(24, T) = {2, 3, 4}, Fr(24, T) = 3

Occurrences and frequences of itemsets

Occ(3, T) = {1, 3}

Fr(3, T) = 2

Occ(3, T) = {1, 3}

Fr(3, T) = 2

id tuples 1 1, 3

2 2, 4

3 1, 2, 3, 4

4 1, 2, 4

(23)

24 2011/09/30 列挙学校（再放送）

Definitions: Frequent sets

24

 The occurrence of X in a database T:

Occ(X, T) = { t

∈

T : X

⊆

t }

 X is σ-frequent (frequent) in T if Fr(X, T) = | Occ(X, T) |

≧

σ.

Frequent sets

∅,

1, 2, 3, 4, 12, 13, 14, 23, 24, 124 Frequent sets

∅,

1, 2, 3, 4, 12, 13, 14, 23, 24, 124

Frequent sets

minsupσ= 2

The itemset lattice (2

^Σ

, ) ⊆

(24)

25 2011/09/30 列挙学校（再放送）

Definitions: Problem



Frequent Itemset Mining Problem

 Given: A transaction database T and a non-negative integer 0 ≦ σ ≦|T|

 Task: Enumerate all "frequent" itemsets X in T that have frequency at least σ (Fr(X)≧σ)

 F: the class of all σ-frequent itemsets

 The number |F| of solutions is exponential in the number n of items.

 a typical enumeration problem.

25

(25)

28 2011/09/30 列挙学校（再放送）

Computational Complexity of Data Mining Algorithms

28



Output-polynomial (OUT-POLY)



Total time = poly(N, M)



polynomial-time enumeration, or amortized polynomial-delay

(POLY-ENUM)



Amotized delay is poly(Input), or



Total time = M·poly(N)



polynomial-delay (POLY-DELAY)



Maximum of delay is poly(Input)

Modeling data mining as enumeration

 Idea: Mesure the computation time per solution

+ polynomial-space (POLY-SPACE)

Output size M

Delay D

Input

Input size M

Total Time T Algorithm

Outputs

Time

(26)

29 2011/09/30 列挙

学校（再放送）

Computational Complexity of Data Mining Algorithms

29

+ polynomial-space (POLY-SPACE)

Output size M

Delay D

Input

Input size M

Total Time T Algorithm

Outputs

Time

Ultimate Goal:

To design

polynomial delay and

polynomial space algorithm

for a given data mining problem

(27)

30 2011/09/30 列挙

パート 1 の 2 ：

頻出集合マイニングの

幅優先 (BFS) 型アルゴリズム

～ Apriori アルゴリズム (Agrawal &

Srikant,1994)

～頻出集合マイニングの最初のアルゴリズム

Rakesh Agrawal, Ramakrishnan Srikant: Fast Algorithms for Mining Association Rules

in Large Databases. VLDB 1994: 487-499.

～データマイニングで最も有名な論文の一つ

(Google Scholar での引用数 : 11215 件 , Sep .2011 ）

(28)

頻出集合マイニング

• データベースの与えられた個数以上の組に含まれる集合（アイテム集合）をすべて見つける問題

• 二つの部分からなる

31 2011/09/30 列挙

• 定められたクラスに属するパターンをすべて

• 生成する組み合わせ列挙

• データベース中での各パターンの出現回数や場所を計

• 算パターンマッチン

• グ埋め込み計算

パターン列挙頻度計算

(29)

頻出集合マイニングの素朴なアルゴリズム

2011/09/30

（再放送）列挙学校有村博紀，北大

32

(30)

頻出集合マイニング

素朴なアルゴリズム (Piatetsky-Shapiro?, KDD'88)

• すべての部分集合を生成し，

• データベースからその頻度を計算する

欠点

• 解となる頻出集合の数に関係なく，

指数個の集合を生成してしまう．

• 頻度計算に（すごく）時間がかかる．

• 百数十アイテム数百レコードに対して，数時間～数日を要する！

33 2011/09/30 列挙

 All the candidate itemsets is stored in a dictionary (a hash tree) in main memory.

34

Dr. Agrawal IBM Almaden Lab.

Prof. Mannila Helsinki Inst. Tech.

Univ. Helsinki

? ?

(32)

Basic idea: “Anti-monotonicity” lemma

35

 Lemma (Agrawal, Srikant): For every minsup σ, if Y is σ-frequent then any subset X of Y is also σ-frequent in DB T.

 Corollary: Every non-empty σ-frequent set Y is obtained from some σ-frequent set X by adding some new element.

 The class F

σ

of all frequent sets is monotone subclass of (2

^Σ

, ⊆ )

Anti-monotone property

child Y

parent X

child Y parent

X

minsupσ= 2

Frequent sets

∅,

1, 2, 3, 4, 12, 13, 14, 23, 24, 124 Frequent sets

∅,

1, 2, 3, 4, 12, 13, 14, 23, 24, 124

The itemset lattice (2

^Σ

, ) ⊆

解の境界

(33)

Cadidate Generation by BFS

1 2 3 4 5

t1 ○ ○

t2 ○ ○

t3 ○ ○ ○ ○

t4 ○ ○ ○

t5 ○ ○ ○

database

Apriori algorithm [Agrawal, Srikant, 1994]

(Also called Level-wise

[Mannila et al., 1994]

)



Slice the item set-lattice 2

^Σ

into levels L

0

, L

1

, …., L

n

()



Compute F

k

= { X

∈ Lk

: X is a frequent set of size k } for k = 0,1,2, … in a bottom-up manner



Generation of the k-th candidate set C

k+1 from Fk

:

• For each a₁…a_k and b₁…b_k∈ F_i with a₁=b₁…a_k-1=b_k-1 and a_k< b_k, add a1…akbk into Ci

level 0 level 1 level 2 level 3 level 4

Φ

1 2 3 4

12 13 14 23 24 34

123 124 134 234

1234

(34)

37 2011/09/30 列挙学校（再放送）

Basic Idea : Cadidate Generation by BFS

1 2 3 4 5

t1 ○ ○

t2 ○ ○

t3 ○ ○ ○ ○

t4 ○ ○ ○

t5 ○ ○ ○

database

Apriori algorithm [Agrawal, Srikant, 1994]

(Also called Level-wise

[Mannila et al., 1994]

)



Slice the item set-lattice 2

^Σ

into levels L

0

, L

1

, …., L

n

()



Compute F

k

= { X

∈ Lk

: X is a frequent set of size k } for k = 0,1,2, … in a bottom-up manner



Generation of the k-th candidate set C

k+1 from Fk

:

• For each a₁…a_k and b₁…b_k∈ F_i with a₁=b₁…a_k-1=b_k-1 and a_k< b_k, add a1…akbk into Ci

level 0 level 1 level 2 level 3 level 4

Φ

1 2 3 4

13 14 23 24 34

123 124 134 234

1234 12

Frequent sets F

(35)

Breadth-first search algorithm



アルゴリズム Apriori ^(T:

データベース

, 0 ≤σ≤|T|: minsup):

出力：全頻出集合の集合

F.

サイズ

1

の頻出集合の全体

F

₁を計算する．

i = 2.



while (F

_i-1

/= ) ∅ do //

ステージ

i:

 STEP1

（候補生成

AprioriGen ） : F

_i-1のサイズ

(i-1)

の頻出集合同士を組合わせ，サイズ

i

の候補集合の集まり

C

_iを計算する．

 STEP2

（頻度計算）

:

データベースを一度だけ走査し，各候補

集合

X C ∈

iの頻度

Freq(X)

を一度に計算する．

F

_i

= . ∅

 STEP3: C

_iから頻度

σ

以上のすべての候補集合を取り出し，

F

_i に入れる．

 STEP4: i = i + 1.



return F = F

₁

∪ … F ∪

i を出力する．

40

(36)

A pool of patterns

Sequential Disk Scan



The state-of-the-art algorithm for association rules



Clever use of the present computer technology

mustard: 24

chips: 124 coke: 254 sausage: 62

beer: 128

d = 1

Apriori アルゴリズム

[Agrawal 1994, Mannila 1995]

coke & hamburger & chips: 24

d = 3

d = 2

sausage & beer coke & chips: 48 coke & hamburger: 52

Medium sized Main memory

Huge Hard disk

Fast Processor

+

Fast counting by Hash tree

Levelwise search

+

Large collection

of data

(37)

議論

Apriori アルゴリズム (Agrawal '94) の意義

• 当時（１９９０年代半ば）の計算機の技術トレンドを最大限に活用：　

それ以前に比べて，格段に大きくなったメモリ（

主記憶）と，安価で高速なＣＰＵ，低速だが極めて大容量のハードディスク

2011/09/30 列挙学校（再放

送）有村博紀，北大 42

Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li: New Algorithms for Fast Discovery of Association Rules. KDD 1997: 283-286; ("prefix equivalence class" method) Shohei Hido, Hiroyuki Kawano: AMIOT: Induced Ordered Tree Mining in Tree-Structured

Databases. ICDM 2005: 170-177

(38)

議論

Apriori アルゴリズム (Agrawal '94) の

「候補生成法」（ AprioriGen 手続き）

は，本当に役立っているか？

• 集合の族の場合，列挙自体はもっと簡単なアルゴリズムでできる ( 例：家系木法 /DFS)

．

• ただし，木やグラフの族の場合，頻度制約を考えると，効率良い場合がある．

2011/09/30 列挙学校（再放

Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li: New Algorithms for Fast Discovery of Association Rules. KDD 1997: 283-286; ("prefix equivalence class" method) Shohei Hido, Hiroyuki Kawano: AMIOT: Induced Ordered Tree Mining in Tree-Structured

Databases. ICDM 2005: 170-177

(39)

44 2011/09/30 列挙学校（再放送）

パート 1 の 3 ：

頻出集合マイニングの

深さ優先 (DFS) 型アルゴリズム

～ Eclat アルゴリズム (Zaki, SDM 2000)

～ (Sese & Morishita, PODS 2000)

～ (Bayardo SIGMOD’98) その他

Mohammed Javeed Zaki, Srinivasan Parthasarathy, Mitsunori Ogihara, Wei Li: New Algorithms for Fast Discovery of Association Rules. KDD 1997:

283-286

Shinichi Morishita, Jun Sese: Traversing Itemset Lattice with Statistical Metric Pruning. PODS 2000: 226-236

(40)

ＢＦＳ型アルゴリズムの欠点

Apriori アルゴリズム (Agrawal et al., '94) ：重複解の回避のため，解をすべて保持する

メモリを使いすぎる！

2011/09/30 列挙学校（再放

(41)

46 2011/09/30 列挙学校（再放送）

深さ優先型 (DFS) アルゴリズム



「バックトラック」アルゴリズム

 第 2 世代の頻出集合発見アルゴリズム

− Eclat [Zaki et al., KDD’97]

− [Morishita DS’98], [Bayardo SIGMOD’98]

− LCM [Uno ら , FIMI'03,'04,DS'04]



アルゴリズムのキモ

 主記憶型の単純な再帰アルゴリズム

 集合の家系木の上で「深さ優先探索」 (DFS) をする．

 「垂直データレイアウト」を採用．

 メモリ効率が良くて，実際に高速．

 いわゆる「パターン成長アプローチ」 "Pattern growth"

46

(42)

48 2011/09/30 列挙学校（再放送）

Two approaches to Frequent sets mining

Backtrack algorithm [1997-1998]

• Depth-first search (DFS)

• Vertical data layout

1 2 3 4 5

t1 ○ ○

t2 ○ ○

t3 ○ ○ ○ ○

t4 ○ ○ ○

t5 ○ ○ ○

database

Apriori algorithm [1994]

• Breadth-first search (BFS)

• Horizontal data layout

• 2

^nd

generation

• In-core algorithm

• Space efficient

• 1

^st

generation

• External memory

algorithm

(43)

DFS(Depth-first search) algorithm

49



DFS algorithm

 How to generate all subsets X⊆ Σin depth-first search?

 Use enumeration of subsets! (

岡本先生の講義の「部分集合列挙」

)

The set lattice for Σ = {1,2,3,4}

σ = 2

F = { , 1, 2, 3, 4, ∅ 12, 13, 14, 23, 24, 124 }

The set lattice L = (2

^Σ

, ⊆ )

(44)

The set enumeration tree (the family tree)

50

 A spanning tree for all frequent itemsets

 Assign the unique parent Pa(Y) = Y { max(Y) } =: X to each non-empty frequent set Y.

 Given a parent X, we can generate its children Y = X∪ { i } by attaching every item i satisfying i > max(X).

The set lattice for Σ = {1,2,3,4}

The set enumeration tree for F

σ = 2

F = { , 1, 2, 3, 4, ∅ 12, 13, 14, 23, 24, 124 }

(45)

51 2011/09/30 列挙学校（再放送）

DFS algorithm for Frequent Itemset Mining



Σ = {1, ..., n}: アイテムの全体集合



Algorithm Backtrack // 主プログラム

 BacktrackExpand(∅, Occ(∅), 0, n);



procedure BacktrackExpand(X, Occ(X), k, n)

 Input: X:itemset, Occ(X), k: tail, n: maximum item;

 if Freq(X) = |Occ(X)| < σ then return; //Backtrack!

 Print a frequent set X;

 for i = k+1,...,n do ：

− 新しい出現リスト Occ(X {i}) ∪ を計算する ;

− BacktrackExpand(X {i} ∪ , Occ(X {i}) ∪ , i, n) ;

51

(46)

52 2011/09/30 列挙学校（再放送）

効率良い頻度計算

52

 Lemma: For any X1, X2, Occ(X ∪ Y) = Occ(X) ∩ Occ(Y).

 Lemma: For any set X, item a, Occ(X∪{a}) = { t∈ Occ(X) : a t ∈ }

Downward closure [Zaki et al ‘97]

52

1 2 3 4 5

t1 t2 t1 t2 t4 t3 t3 t3 t3 t5 t4 t4 t5

t5

1

t1 t3 t5

2

t2 t3 t4 t5

2 3

t3 t4

2 4

t2 t3 t5

 t1 t2 t3 t4 t5

3

4 2

1

Occ(2 4) Occ(2)

Occ()

UpdateOcc

Occ(1)

Occ(2 3)

tuples

items

t1 1 2 3 4

t2 2 4 5

t3 1 2 4 5 …

t4 2 3

t5 1 2 4

database D

Vertical layout:

The occurrence list representation

items

DFS over the set enumeration tree

アイテムを足すごとに各集合の出現リストを更新する

(47)

53 2011/09/30 列挙学校（再放送）

Efficient update of Occ(X)



手続き UpdateOcc(X, a, O(X))

Input ： X⊆Σ と , a ∈Σ, O(X).

Output: O(X {a} ∪ ).

 Y = X {a} ∪ ; Occ(Y) = ; ∅

 foreach t O(X) do ∈

− if a t ∈ then Occ(Y) = Occ(Y) {t} ∪ ;

 return Occ(Y);

53



Properties:

 For any X1, X2, Occ(X ∪ Y) = Occ(X) ∩ Occ(Y). (basic)

 For any set X, item a, Occ(X∪{a}) = { t∈ Occ(X) : a t ∈ }

Downward closure [Zaki et al ‘97]

古い出現リストか

ら，一個ずつ位置

を取り出して，ま

だ大丈夫か調べる

古い出現リストか

ら，一個ずつ位置

を取り出して，ま

だ大丈夫か調べる

(48)

DFS: Main result



Theorem

 The algorithm Backtrack can be implemented to enumerates all σ-frequent itemsets X in a given transaction database T

− in O( l ・ |Occ(X)|) time per frequent itemset

− using O( l ・ |Occ(X)|) space

 where l is the maximum length of tuples in T and | Occ(X)| is the number of occurrences of X.

 Space and time efficient (poly-delay & poly-space)

54

 On the other hand, the algorithm Apriori requires O( l ・ |

Occ(X)|) time per frequent itemset and O(max

_i

|| F

_i

^{||) space.}

(49)

55 2011/09/30 列挙学校（再放送）

Summary: BFS vs. DFS

Backtrack algorithm [1997-1998]

• Depth-first search (DFS)

• Vertical data layout

x1 x2 x3 x4 x5

t1 ○ ○

t2 ○ ○

t3 ○ ○ ○ ○

t4 ○ ○ ○

t5 ○ ○ ○

database

Apriori algorithm [1994]

• Breadth-first search (BFS)

• Horizontal data layout

• 2

^nd

generation

• Space efficient

• In-core algorithm

• 1

^st

generation

• External memory

algorithm

(50)

56 2011/09/30 列挙学校（再放送）

Summary: Horizontal & Vertical Layout

56

1 2 3 4 5

t1 ○ ○

t2 ○ ○

t3 ○ ○ ○ ○

t4 ○ ○ ○

t5 ○ ○ ○

tu p le s

データベース

D

t1 1 3 t2 2 4

t3 1 2 3 4 t4 2 3

t5 1 2 4

1 2 3 4 5

t1 t2 t1 t2 t4 t3 t3 t3 t3 t5 t4 t4 t5

t5

垂直

/ Vertical layout (Backtrack

アルゴリズ

ム

)

水平

/ Horizontal layout

(Apriori

アルゴリズム

)

1 t2 t3 t5

12 t3 t5

候補アイテム集合の出現数をまとめて計数する

Scan

set freq

123 1 124 2 134 1 234 1 245 2

候補アイテム集合の出現リストを，

追加ごとに更新する

Scan

124 t3 t5

外部記憶向き主記憶

向き

(51)

* 1 2 5 6 7 9 7 8 9 7 9

2 3 4 5 7 9 7 9

A C B

E F

D

57 2011/09/30 列挙学校（再放送）

FP-growth アルゴリズム

 最もよく知られたアルゴリズムの一つ（Hanら，SIGMOD2000）

 パターンとデータを木構造(FP-Tree ="Frequent Pattern Tree") に格納する．当初は「候補生成なしマイニング」と呼ばれていた．

 現在では，DFS 法の変種と理解（同じパターン列挙と出現計算）

 実際のデータに対して高速といわれている

J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. SIGMOD2000, 1-12 (2000)

頻出パターン木

(FP-Tree)

データベース木

27 あ頻出集合

Φ

{1} {2} {7}

{9} {1,7}

{1,9} {2,7}

{2,9} {7,9}

{1,7,9}

{2,7,9}

頻出集合

Φ

{1} {2} {7}

{9} {1,7}

{1,9} {2,7}

{2,9} {7,9}

{1,7,9}

{2,7,9}

φ

1

7 9 9 2

7

9 9

7 9 179

19

29 79 9

φ 2 7 27 27

データベース A: 1,2,5,6,7,9 B: 2,3,4,5 C: 1,2,7,8,9 D: 1,7,9 E: 2,3,7,9 F: 2,7,9

9

9 9

9

集合 279 の出現リスト

2 7

7 2

7 7

集合 27 の出現リスト

9 279

集合 27

(52)

58

ＬＣＭアルゴリズム　 ^（ Linear-time Closed Itemset Miner ）

 飽和集合／極大集合マイニングアルゴリズム

 理論的性能保証：出力線形時間アルゴリズム

 "The world-fastest closed set mining algorithm"

(according to IEEE ICDM "the top-10 data mining algorithms" article)

 飽和集合／極大集合マイニングアルゴリズム

 理論的性能保証：出力線形時間アルゴリズム

 "The world-fastest closed set mining algorithm"

(according to IEEE ICDM "the top-10 data mining algorithms" article)

くわしくは宇野毅明の講義で．．．

• LCM, 宇野先生HP, program codes ：http://research.nii.ac.jp/~uno/codes.htm

• FIMI Frequent Itemset Mining Implementations Repository: http://fimi.cs.helsinki.fi/

 Feature extraction for SVM & Boosting

 Closed and Maximal Frequent Itemset Mining

 Sequences and Graphs

(54)

60 2011/09/30 列挙学校（再放送）

Bibliography

パート1 ：データマイニングと頻出集合発見

• [AIS 93] R. Agrawal, T. Imielinski, A. N. Swami: Mining Association Rules between Sets of Items in Large Databases, Proc. SIGMOD Conference 1993, pp. 207-216, 1993.

• [AMST 96] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, A. I. Verkamo: Fast Discovery of Association Rules, Advances in Knowledge Discovery and Data Mining, pp. 307-328, 1996.

• [Agrawal & Srikant, VLDB’94] R. Agrawal, R. Srikant: Fast Algorithms for Mining Association Rules in Large Databases. Proc. VLDB 1994, pp. 487-499, 1994. (Apriori)

• [Bayardo, SIGMOD’98] R. J. Bayardo Jr.: Efficiently Mining Long Patterns from Databases, Proc. SIGMOD Conference 1998: pp. 85-93, 1998.

• [Han et al. SIGMOD’00] J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without Candidate Generation, Proc. SIGMOD Conference 2000, pp. 1-12, 2000. (FP-growth)

• [Morishita & Sese PODS’00] S. Morishita, J. Sese: Traversing Itemset Lattice with Statistical Metric Pruning, Proc. PODS 2000, pp. 226-236, 2000.

• [Zaki et al., KDD’97] M. J. Zaki, S. Parthasarathy, M. Ogihara, W. Li, New algorithms for fast discovery of association rules, Proc. KDD 1997, pp. 283-286, 1997. (Eclat)

• [ 宇野，有村] 宇野毅明，有村博紀, 「データインテンシブコンピューティングその２－頻出アイテム集合発見アルゴリズム－」 ,人工知能学会誌，レクチャーシリーズ「知能コンピューティングとその周辺」,

( 編) 西田豊明 , 第2 回 , Vol.22, No.3, 2007 年5 月 . (PDF: http://www- ikn.ist.hokudai.ac.jp/~arim/jtalks.html)

(55)

61 2011/09/30 列挙学校（再放送）

列挙学校：第４コマ

パート２ : 系列とグラフのマイニング

有村博紀

北海道大学大学院情報科学研究科コンピュータサイエンス専攻 email: {arim,kida}@ist.hokudai.ac.jp

http://www-ikn.ist.hokudai.ac.jp/~arim

．

(56)

62 2011/09/30 列挙学校（再放送）

パート２の１：

半構造マイニング

～頻出グラフマイニングの定式化（ Inokuchi, Washio, Motoda, PKDD2000)

その他

(57)

64

半構造データマイニング

知識 Knowledge

H H H XXCC H

H H X XC C

2.

知識を

2.

知識を

知識を有機的につなげる知識を有機的に

つなげる

3.

新たな

知識の構築

3.

新たな

知識の構築

 半構造データ

 多様な構造をもつ膨大なデータ

 従来のデータマイニング手法は，

半構造データには直接適用不可能

 新しい半構造データマイニング技術が必要



高速かつ頑健なマイニング技術が鍵

文科省科研費特別推進研究（平成１７年～

１９年）「知識基盤形成のための大規模半構造データからの超高速パターン発見」

2011/09/30 列挙学校（再放送），有村博紀，北大

(58)

65

Applications of Semi-structured Data Mining

 Link structures in Web

 Gene Networks...

H

H H

X X

C C

(59)

66

半構造データマイニングの歴史

～ 1995 1996 1997 1998 1999 2000 2001 2002

2003

Algorithm for finding subgraphs by MDL principle Subdue [Holder et al. (KDD’94)]

Finding frequent paths [Wang and Liu (KDD’97)]

Finding Semi-structured Schema

[Nestrov, Abiteboul et al. (SIGMOD’98)]

Finding frequent subgraphs

AGM [Inokuchi, Wahio, Motoda (PKDD’00, MLJ. 2003)]

Finding frequent / optimal ordered trees

FREQT [Ours (SDM’02)]

，

Treeminer [Zaki (KDD’02)]

Finding frequent subgraphs

gSpan [Yan and Han (ICDM’02)], VG [Venetik, et al.(ICDM’02)]

Finding frequent un-ordered trees

UNOT [Ours (SDM’03)]

，

NK [Nijssen, Kok (MGTS’03)]

FSG [Kuramochi et al. (ICDM’01)

Frequent Maximal/Closed Trees & Graphs mining [Yan&Han '03;

Termier et al.'04] and many algorithms in 2000s

(60)

68 2011/09/30 列挙学校（再放送）

パート２の２：

頻出順序木のマイニング

～ TreeMiner (Zaki, KDD2002)

～ Freqt (Asai et al., SDM2002)

～ Nakano (IPL, 2002)

Mohammed Javeed Zaki: Efficiently mining frequent trees in a forest. KDD 2002: 71-80.

Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Arimura, Hiroshi Sakamoto, Setsuo Arikawa: Efficient Substructure Discovery from Large Semi-

structured Data, SDM 2002.

Shin-Ichi Nakano: Efficient generation of plane trees. Inf. Process. Lett.

84(3): 167-172 (2002)

(61)

FIT2008, 有村博紀，北海道大学

Frequent Pattern Mining for

Ordered Trees

Minining ordered and

unordered trees

(62)

70

Tree Mining

Association Rule, Frequent Itemsets

Simple

Efficient algorithms

Expressive

Computationally Hard General graphs

(AGM, FSG, gSpan)

Trees Development of

Efficient tree mining algorithms

Expressive patterns & Efficient algorithms

(63)

FIT2008, 有村博紀，北海道大学

71 71

研究成果：高速半構造マイニング

 高速な木パターン発見エンジン

 半構造データの特徴的部分構造の発見

 理論と実装： FREQT, StreamT, Unot

 SIAM DM'02, PKDD'02, IEEE ICDM'02), DS'03

 公開＆応用

FREQT: DEWS'02優秀論文賞（H14 年6 月）Unot: DEWS’04優秀論文賞（H16年6 月）

MaxMotif: Awarded DEWS2004 （H17.

6）

StreamT: ’０３AI学会大会優秀賞

（H15.6）

(64)

72

 Labeled Ordered Tree

 根付き (rooted)

 順序木 ( 兄弟の順序がある )

 ラベル付き ( 各ノードはラベルをもつ )

 例

 HTML/XML 文書

 階層レコード

 自然言語の「かがり受け木」

(dependency tree).

email people

@age @id

608

name

#text

#text person

name

@id

609

tel

#text

#text person

25

[email protected]

John

555-4567

Mary

パターンとデータ：ラベル付き順序

.

順序木のマッチング（パターン照合）

パターン木 T がデータ木

D にマッチする ⁽

^出現する

⁾ There is a matching

function  from T into D

r

C

B A

C B

data tree D

pattern tree T

1

2

3 4 5

6

7

8

9 10

11

有村 博紀 北海道大学