• 検索結果がありません。

大規模グラフ処理 鈴村研究室 大規模データ処理・ストリームコンピューティング Graph500 HPDC2012

N/A
N/A
Protected

Academic year: 2018

シェア "大規模グラフ処理 鈴村研究室 大規模データ処理・ストリームコンピューティング Graph500 HPDC2012"

Copied!
44
0
0

読み込み中.... (全文を見る)

全文

(1)

Highly Scalable Graph Search for

the Graph500 Benchmark

Koji Ueno

1,2

and Toyotaro Suzumura

1,2,3

(2)

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

(3)

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

(4)

Large-Scale Graph Mining is Everywhere

Internet Map

Symbolic Networks: Human Brain

Protein Interactions Social Networks

Cybersecurity

Medical Informatics Data Enrichment Social Networks Symbolic Networks

(5)

New Graph Search Based Bechmark fo

r Ranking Supercomputers

BFS (Breadth First Search) from a singl

e vertex on a static, undirected Kroneck

er graph with average vertex degree 3

2.

Reference MPI, OpenMP and shared m

emory implementations are provided.

Graph500 (http://www.graph500.org)

(6)

Benchmark Flow

Graph (Edge List)

Generation

Graph (Edge List)

Generation

Graph Construction

(Conversion to any space-efficient format such as CSR or CSC)

Graph Construction

(Conversion to any space-efficient format such as CSR or CSC)

Breadth First Search

(Main Kernel)

Breadth First Search

(Main Kernel)

Validation for BFS tree (5

rules)

Validation for BFS tree (5

rules)

64

times

iteration

Kernel 1

Kernel 2

11

22 22

33

11 22

33 Sampling 64 Search Keys

(7)

Evaluation criteria: TEPS (Traversed Edges Per Second) and problem s

ize that can be solved on the system.

In this presentation, 1 GTEPS = 1,000,000,000 TEPS.

Problem Size: SCALE

# of Vertices 2

SCALE

, # of Edges 2

SCALE+4

Graph500 (http://www.graph500.org)

Rank Machine Owner Problem Size GTEPS

1 NNSA/SC Blue Gene/Q Prototype II (4096 nodes / 65,536 cores )

NNSA and IBM

Research, T.J. Watson 32 254.349

2 Lomonosov (4096 nodes / 32,768

cores) Moscow State

University 37 103.251

3 TSUBAME (2732 processors / 1366 nodes / 16,392 CPU cores)

GSIC Center, Tokyo Institute of

Technology 36 100.366

4 Jugene (65,536 nodes) Forschungszentrum

Jülich 37 92.8769

5 Intrepid (32,768 nodes / 131,072

cores) ANL 35 78.8699

Nov 2011 List Nov 2011 List

(8)

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

(9)

Breadth-First Search

Outputs for each vertex:

Predecessor – a parent vertex in BFS tree

Depth – distance from source vertex

Level Synchronized BFS [David ICPP‘06]

Base algorithm of all reference implementation.

“Level Synchronized” means all vertices of a particular level are inspected before the n ext level.

S

B

C E

F

G H D

S

B C D

E F

BFS tree

BFS tree

Graph

Graph

Source vertex Source vertex

Level 0 Level 1 Level 2

Depth Depth

(10)

Two queues are employed for level synchronized BFS : Current Queue (CQ) and N ext Queue (NQ).

We start BFS by putting source vertex into CQ.

Level Synchronized BFS

Demonstration (1/5)

CQ

NQ

S

S

BFS tree BFS tree

Level 0 S

B

C E

F

G H D

(11)

Neighbor vertices of the vertices in CQ a re queued into

NQ and added to the BFS tree.

Level Synchronized BFS

Demonstration (2/5)

S

B C D

BFS tree BFS tree

Level 0 Level 1 S

B

C E

F

G H D

CQ

NQ

B C D

S

(12)

Level Synchronized BFS

Demonstration (3/5)

S

B C D

E F

BFS Tree BFS Tree

After NQ and CQ are swapped, the next level is proc

essed.

Level 0 Level 1 Level 2 S

B

C E

F

G H D

CQ

NQ

F E

B C D

Duplicate Request

(13)

Level Synchronized BFS

Demonstration (4/5)

S

B C D

E F

G H

BFS tree BFS tree

Level 0 Level 1 Level 2 Level 3 S

B

C E

F

G H D

CQ

NQ

G H

E F

(14)

If NQ is empty, we finish BFS.

Level Synchronized BFS

Demonstration (5/5)

S

B C D

E F

G H

BFS tree BFS tree

Level 0 Level 1 Level 2 Level 3 S

B

C E

F

G H D

CQ

NQ

G H

(15)

Level Synchronized BFS can be solved by representi

ng it as Sparse Matrix Vector multiplication (SpMV) f

or distributed computing.

Demonstration of processing the level 2 with the pre

vious example is illustrated as follows.

Representation as Sparse Matrix Vector Multiplic

ation (SpMV)

S

B

C E

F

G H

D x

x

CQ NQ

Illustration of

Level 2 with

BFS search

Illustration of

Level 2 with

BFS search

(16)

Representation as SpMV

- Demonstration (1/4)

CQ

S B C D E F G H

S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

H ● ●

×

NQ

VISITED

-

Adjacency Matrix

(Sparse Matrix)

Example: Processing level 2 Example: Processing level 2

S

B

C E

F

G H

D x

x

CQ

Current Queue

NQ

(Vector)

S B C D E F G H

(17)

Representation as SpMV

- Demonstration (2/4)

S B C D E F G H S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

CQ ● ● ●

“Edge Frontier” is all edges emanating from the vertices of CQ.

Edge frontier is obtained by multiplying the adjacency matrix with CQ.

Edge Frontier Edge Frontier

x x

CQ NQ

Edge Frontier S

B

C E

F

G H D

× -

VISITED

S B C D E F G

(18)

Representation as SpMV

- Demonstration (3/4)

NQ

B C D

● ● ●

● ●

x x

CQ NQ

Edge Frontier S

B

C E

F

G H D

Then we get neighbor vertices.

NQ is computed by excluding

visited vertices (VISITED) from

neighbor vertices.

-

VISITED

S B C D E F G H

(19)

Representation as SpMV

- Demonstration (4/4)

CQ

S B C D E F G H

S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

×

NQ

VISITED

-

Adjacency Matrix

(Sparse Matrix)

Example: Processing level 2 Example: Processing level 2

S

B

C E

F

G H

D x

x

CQ

Current Queue

NQ

(Vector)

S B C D E F G

(20)

Partitioning Methods for BFS

1D Partitioning BFS

Graph500 reference implementation

s use 1D partitioning.

All-to-all communication is required t

o compute BFS.

All-to-all communication is not scalable

on large distributed environment.

2D Partitioning BFS [Andy, SC’05]

Scalable partitioning method for larg

e distributed environment.

We used 2D partitioning.

[Andy, SC’05] Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-

1D Partitioning

2D Partitioning

(21)

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

(22)

Processor Mesh

2D partitioned adjacency matrix

2D Partitioning [Andy, SC’05]

[Andy, SC’05] Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-

2D Partitioning for BFS [Andy, SC’05] has been proposed to avoid all

-to-all communication.

Assume that we have 4 processors, the proposed method introduces

the notion of “Processor Mesh” that deploys the processors in 2D.

Adjacency matrix is partitioned into 4×2 blocks and assignment to ea

ch processor is the following manner.

(23)

2D Partitioning

- Assigning each vertices block to each processor

CQ

S B C D E F G H

S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

×

NQ

Processors

S B C D E F G

With 2D Partitioning, 4 data structures

including adjacency matrix, CQ, VISITED and

NQ are partitioned and assigned to each

processor in the following manner

VISITED

-

(24)

2D Partitioning

- Obtaining Edge Frontier

×

Processors S B C D E F G H

S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

H ● ●

Obtain edge frontier by multiplying adjacency (sparse) matrix and CQ vector

This requires each processor in a processor column of processor mesh to ex change CQ and have the same copy of CQ (called “Expand” described later)

CQ ● ● ●

S B C D E F G H VISITED

-

(25)

2D Partitioning

- Excluding Visited vertices from edge frontier

VISITED

-

Processors B C D

● ● ●

● ●

NQ

S B C D E F G

After identifying the edge frontier, NQ is obtained

by excluding visited vertices from edge frontier

This phase also requires the communication amo

ng processors in a processor row of processor me

sh

(26)

Communication of 2D Partitioning

There are 2 communication phases at each level.

First Phase: Expand

Each processor gathers CQ from the processors in a column of processor mesh

Second Phase: Fold

Processors exchange edge frontier in a row of processor mesh.

The number of nodes involved for communication becomes smaller

compared to 1D partitioning that requires all-to-all communication.

1D Partitioning 2D Partitioning

Expand

Expand FoldFold All to All

All to All

(27)

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

(28)

Our Proposed Approach

Our proposed approach is based upon 2D partitionin

g and further optimized with the following 3 methods

to achieve more high performance score

1.

Compressed Communication

2.

Parallelizing Sender and Receiver Processing

3.

Vertex Sorting for Improving Cache Hit Rate

(29)

Our Optimized Algorithm

Expand phase

Expand phase Fold phaseFold phase

(30)

Compressed Communication

Compressed

Communication in Fold Phase

Compressed

Communication in Fold Phase

(31)

Different approach from the original 2D partitionin

g

Original 2D partitioning [Andy, SC’05]

They use torus network.

Their method reduces duplicate requests on each processor in the Fol d operation and decreases total message volume.

However, reducing duplicate requests takes high computation cost.

We decided not to use this technique due to its high computatio

n cost. Therefore, we need to reduce communication data wit

h another method.

Torus Network Reducing Duplicate Request

Reducing Duplicate Request

Processor Duplicate Request

(32)

Compressed Communication

Source Destination

3

300

80

Original Vertex IDs

100,103, 303, 383, 123839,

Difference VLQ (hexadecimal)

03

AC, 02

50

We introduce compressed communic

ation to reduce communication data.

By checking CQ in order of vertex ID,

the vertices in Edge Frontier will be s

orted in ascending-order.

We encode the difference from the

prior vertex with Variable-Length Q

uantity (VLQ) which is used by the st

andard MIDI file format and Google’s

protocol buffers etc.

This method can compress 64 bit valu

e to 8 bit if the value is small enough.

(33)

Parallelizing Sender and Receiver Processin

g

Parallelizing Sender and Receiver Processing Parallelizing Sender and

Receiver Processing

(34)

Parallelizing Sender and Receiver Processing

Approaches by Prior Arts [Aydlin, SC’11] : After all edge frontier is insp

ected and stored into memory, it is exchanged among the processors.

Inefficient: This requires large memory space. Communication and computati on can not be overlapped.

We divide the processing of the Fold phase into senders and receiver

s and parallelize them like stream processing.

We also use asynchronous communication in the Fold phase for eff

icient processing.

Sender Processing

extracts edges emanating from the vertices in CQ

Receiver Processing decodes edges and update BFS tree and NQ

Asynchronous Communication

Edge Frontier

(35)

Vertex Sorting for Improving Cache Hit Rate

Vertex Sorting to increase the locality of accessing to

VISITED

Vertex Sorting to increase the locality of accessing to

VISITED

(36)

Vertex Sorting for Improving Cache Hit Rate

Receiver processing generates irre

gular memory access to the VISITE

D, which wastes CPU time for mem

ory access waiting.

VISITED is represented as a bitmap.

The frequency of access to VISITED i

s determined by the vertex degree (th

e number of edges of each vertex).

We sort the vertices in decreasing

order of degrees to increase localit

y of memory access.

The degree distribution of Kronecker g

raph is followed by power law.

Sorting

Degree Distribution of Kronecker Graph Degree Distribution

of Kronecker Graph

VISITED bitmap

(37)

Implementation Detail

Programming Model: MPI + OpenMP

CQ, NQ, VISITED: Implemented using bitmap.

A vertex is represented using a bit of bitmap.

If a bit of bitmap is set, corresponding vertex is contained in the

queue. If not, it is not contained.

Expand

Single call of MPI_Allgather()

Adjacency matrix

Using CSR (Compressed Sparse Row) as a memory layo

ut.

・・・ 0001010 ・・・ bitmap

(38)

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

(39)

TSUBAME 2.0

Specification

CPU Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz, 6 CPU Cores (12 cores with Hyper Threading) x 2 sockets per 1 node (24 Threads)

RAM 54 GB

OS SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)

# of Total Nodes 1442 nodes (We only used up to 512 nodes) Network Topology Full-Bisection Fat-Tree Topology

Network Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)

GPGPU Three NVIDIA Fermi M2050 GPUs (*Not used) Software

GCC GCC 4.3.4 (-O3 option)

(40)

Performance Evaluation

- Comparison with Reference Implementations

We compare our optimized method with reference implementations

(Simple, Rep-CSR, and Rep-CSC).

Reference implementations do not run on large scale due to errors.

Our optimized implementation is fast and scalable.

0 256 512 768 1024

0 20000000000 40000000000 60000000000 80000000000 100000000000 120000000000

11,280,800,000.0 21,345,100,000.0

37,248,500,000.0

63,487,300,000.0

99,000,000,000.0

# of nodes

TEPS (GTEPS)

0 32 64 96 128

0 5000000000 10000000000 15000000000 20000000000

25000000000 Optimized

Reference Simple Reference Rep- CSR

# of nodes

TEPS (GTEPS)

(41)

Performance Evaluation

- Effect of Optimizations

Compressed communication greatly boosts the performance.

Overall communication data volume with compressed comm

unication is 1/3 of the base version that does not adopt the co

mpression method.

Performance as TEPS per Node

Performance as TEPS

per Node Communication Data Volume

Communication Data Volume

20000000000 30000000000 40000000000 50000000000 60000000000

70000000000 Base

Compress Communication replicated (reference)

Aggregated Message Size per Node (GB)

0 50000000 100000000 150000000 200000000 250000000 300000000 350000000

Reference-Simple Base

Compressed Communication Compress, Vertex Sorting

TEPS per node (MTEPS)

(42)

Performance Evaluation

- Breakdown of Execution Time

There is communication wait time when the number of nodes is larger than 16.

Because communication and computation is completely overlapped in our impl ementation, the performance of our optimized implementation is limited b y communication bandwidth.

This is why vertex sorting does not show any effect with more than 16 nodes.

Performance breakdown of the overall execution time Performance breakdown of

the overall execution time

1 2 4 8 16 32 64 128 256 512 0

1 2 3 4 5 6 7 8 9

10 wait-synch wait-comm compute expand

Execution Time (seconds)

1 2 4 8 16 32 64 128 256 512 0

50000000 100000000 150000000 200000000 250000000 300000000

350000000 Reference-Simple Base

Compressed Communication Compress, Vertex Sorting

TEPS per node (MTEPS)

Performance as TEPS per Node

Performance as TEPS per Node

(43)

Related Work

BFS on Distributed Memory Systems

Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-First Search Alg orithm on BlueGene/L. SC ‘05.

Aydin Buluc and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. SC ‘11.

They compared the performance of 1D and 2D partitioning methods.

BFS on GPUs

Duane Merrill, Michael Garland, and Andrew Grimshaw. Scalable G PU graph traversal. PPoPP '12.

BFS on Multi-Socket CPU Systems

Jatin Chhugani et al. Fast and Efficient Graph Traversal Algorithm for CP Us : Maximizing Single-Node Efficiency. IPDPS 2012.

BFS on Shared Memory Systems

David A. Bader, Kamesh Madduri. Designing Multithreaded Algorithms for

(44)

Conclusions

We proposed an optimized implementation of the Graph500 bench

mark on a large-scale distributed memory environment.

Our optimized implementation is based on the level-synchronized

BFS with 2D partitioning and we proposed some optimization meth

ods such as communication compression and vertex sorting.

Our implementation can solve BFS of large-scale graph with 68.7

billion vertices and 1.1 trillion edges for 10.58 seconds with 1366

nodes and 16K CPU cores, which corresponds to 103.9 GTEPS.

Because the result of Graph500 is median TEPS among 64 BFS e

xecutions, our record is 100 GTEPS, which is ranked No.3 on Nov

2011 List.

Recent Results

We used GPGPU for Graph500 on TSUBAME2.0 and get the performanc e of 317 GTEPS, which is ranked No.4 on Jun 2012 List.

参照

関連したドキュメント

For example, random geometric graphs are formed by randomly assign- ing points in a Euclidean space to vertices and then adding edges deterministically between vertices when

水処理設備部 水処理設備第二

(4S) Package ID Vendor ID and packing list number (K) Transit ID Customer's purchase order number (P) Customer Prod ID Customer Part Number. (1P)

(1982)第 14 項に定められていた優越的地位の濫用は第 2 条第 9 項第 5

(2号機) 段階的な 取り出し

(2号機) 段階的な 取り出し

(2) 産業廃棄物の処理の過程において当該産業廃棄物に関して確認する事項

全社安全環境品質管理委員会 内部監査委員 EMS管理責任者 (IFM品質統括部長).