大規模グラフ処理鈴村研究室大規模データ処理・ストリームコンピューティング Graph500 HPDC2012

(1)

Highly Scalable Graph Search for

the Graph500 Benchmark

Koji Ueno

^1,2

and Toyotaro Suzumura

^1,2,3

(2)

Outline



_Background



Breadth-First Search and Distributed Algorithm



2D Partitioning based Breadth-First Search



Optimizations



Performance Evaluation



Conclusions

(3)

Outline



_Background



Breadth-First Search and Distributed Algorithm



2D Partitioning based Breadth-First Search



Optimizations



Performance Evaluation



Conclusions

(4)

Large-Scale Graph Mining is Everywhere

Internet Map

Symbolic Networks: Human Brain

Protein Interactions Social Networks

Cybersecurity

Medical Informatics Data Enrichment Social Networks Symbolic Networks

(5)



New Graph Search Based Bechmark fo

r Ranking Supercomputers



BFS (Breadth First Search) from a singl

e vertex on a static, undirected Kroneck

er graph with average vertex degree 3

2.



Reference MPI, OpenMP and shared m

emory implementations are provided.

Graph500 (http://www.graph500.org)

(6)

Benchmark Flow

Graph (Edge List)

Generation

Graph (Edge List)

Generation

Graph Construction

(Conversion to any space-efficient format such as CSR or CSC)

Graph Construction

(Conversion to any space-efficient format such as CSR or CSC)

Breadth First Search

(Main Kernel)

Breadth First Search

(Main Kernel)

Validation for BFS tree (5

rules)

Validation for BFS tree (5

rules)

64 times

iteration

Kernel 1

Kernel 2

11

22 22

33

11 22

33 Sampling 64 Search Keys

(7)



Evaluation criteria: TEPS (Traversed Edges Per Second) and problem s

ize that can be solved on the system.

 In this presentation, 1 GTEPS = 1,000,000,000 TEPS.



Problem Size: _SCALE



# of Vertices 2

^SCALE

, # of Edges 2

^SCALE+4

Graph500 (http://www.graph500.org)

Rank Machine Owner ^Problem_Size GTEPS

1 NNSA/SC Blue Gene/Q Prototype II (4096 nodes / 65,536 cores )

NNSA and IBM

Research, T.J. Watson ³² ^254.349

2 Lomonosov (4096 nodes / 32,768

cores) Moscow State

University ³⁷ ^103.251

3 TSUBAME (2732 processors / 1366 nodes / 16,392 CPU cores)

GSIC Center, Tokyo Institute of

Technology ³⁶ ^100.366

4 Jugene (65,536 nodes) Forschungszentrum

Jülich ³⁷ ^92.8769

5 Intrepid (32,768 nodes / 131,072

cores) ^ANL ³⁵ ^78.8699

Nov 2011 List Nov 2011 List

(8)

Outline



_Background



Breadth-First Search and Distributed Algorithm



2D Partitioning based Breadth-First Search



Optimizations



Performance Evaluation



Conclusions

(9)

Breadth-First Search

 Outputs for each vertex:

 Predecessor – a parent vertex in BFS tree

 Depth – distance from source vertex

 Level Synchronized BFS [David ICPP‘06]

 Base algorithm of all reference implementation.

 “Level Synchronized” means all vertices of a particular level are inspected before the n ext level.

S

B

C E

F

G H D

S

B C D

E F

BFS tree

Graph

Source vertex Source vertex

Level 0 Level 1 Level 2

Depth Depth

(10)

 Two queues are employed for level synchronized BFS : Current Queue (CQ) and N ext Queue (NQ).

 We start BFS by putting source vertex into CQ.

Level Synchronized BFS

Demonstration (1/5)

CQ

NQ

S

BFS tree BFS tree

Level 0 S

B

C E

F

G H D

(11)



Neighbor vertices of the vertices in CQ a re queued into

NQ and added to the BFS tree.

Level Synchronized BFS

Demonstration (2/5)

S

B C D

BFS tree BFS tree

Level 0 Level 1 S

B

C E

F

G H D

CQ

NQ

^B ^C ^D

S

(12)

Level Synchronized BFS

Demonstration (3/5)

S

B C D

E F

BFS Tree BFS Tree



After NQ and CQ are swapped, the next level is proc

essed.

Level 0 Level 1 Level 2 S

B

C E

F

G H D

CQ

NQ

^F ^E

B C D

Duplicate Request

(13)

Level Synchronized BFS

Demonstration (4/5)

S

B C D

E F

G H

BFS tree BFS tree

Level 0 Level 1 Level 2 Level 3 S

B

C E

F

G H D

CQ

NQ

^G ^H

E F

(14)



If NQ is empty, we finish BFS.

Level Synchronized BFS

Demonstration (5/5)

S

B C D

E F

G H

BFS tree BFS tree

Level 0 Level 1 Level 2 Level 3 S

B

C E

F

G H D

CQ

NQ

G H

(15)



Level Synchronized BFS can be solved by representi

ng it as Sparse Matrix Vector multiplication (SpMV) f

or distributed computing.



Demonstration of processing the level 2 with the pre

vious example is illustrated as follows.

Representation as Sparse Matrix Vector Multiplic

ation (SpMV)

S

B

C E

F

G H

D ^x

x

CQ NQ

Illustration of

Level 2 with

BFS search

Illustration of

Level 2 with

BFS search

(16)

Representation as SpMV

- Demonstration (1/4)

CQ

●

● S B C D E F G H

S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

H ● ●

×

NQ

●

● VISITED

●

-

●

Adjacency Matrix

(Sparse Matrix)

Example: Processing level 2 Example: Processing level 2

→

S

B

C E

F

G H

D ^x

x

CQ

Current Queue

NQ

(Vector)

S B C D E F G H

(17)

Representation as SpMV

- Demonstration (2/4)

S B C D E F G H S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

CQ ● ● ●

 “Edge Frontier” is all edges emanating from the vertices of CQ.

 Edge frontier is obtained by multiplying the adjacency matrix with CQ.

Edge Frontier Edge Frontier

x x

CQ NQ

Edge Frontier S

B

C E

F

G H D

× -

VISITED

●

S B C D E F G

(18)

Representation as SpMV

- Demonstration (3/4)

NQ

●

● B C D

● ● ●

●

● ●

x x

CQ NQ

Edge Frontier S

B

C E

F

G H D



Then we get neighbor vertices.



NQ is computed by excluding

visited vertices (VISITED) from

neighbor vertices.

- ^→

VISITED

●

S B C D E F G H

(19)

Representation as SpMV

- Demonstration (4/4)

CQ

●

● S B C D E F G H

S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

×

NQ

●

● VISITED

●

-

●

Adjacency Matrix

(Sparse Matrix)

Example: Processing level 2 Example: Processing level 2

→

S

B

C E

F

G H

D ^x

x

CQ

Current Queue

NQ

(Vector)

S B C D E F G

(20)

Partitioning Methods for BFS



1D Partitioning BFS



Graph500 reference implementation

s use 1D partitioning.



All-to-all communication is required t

o compute BFS.



All-to-all communication is not scalable

on large distributed environment.



2D Partitioning BFS [Andy, SC’05]



Scalable partitioning method for larg

e distributed environment.



We used 2D partitioning.

[Andy, SC’05] Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-

1D Partitioning

2D Partitioning

(21)

Outline



_Background



Breadth-First Search and Distributed Algorithm



2D Partitioning based Breadth-First Search



Optimizations



Performance Evaluation



Conclusions

(22)

Processor Mesh

2D partitioned adjacency matrix

2D Partitioning [Andy, SC’05]

[Andy, SC’05] Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-



2D Partitioning for BFS [Andy, SC’05] has been proposed to avoid all

-to-all communication.



Assume that we have 4 processors, the proposed method introduces

the notion of “Processor Mesh” that deploys the processors in 2D.



Adjacency matrix is partitioned into 4×2 blocks and assignment to ea

ch processor is the following manner.

(23)

2D Partitioning

- Assigning each vertices block to each processor

CQ

●

● S B C D E F G H

S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

×

NQ

●

→

Processors

S B C D E F G



With 2D Partitioning, 4 data structures

including adjacency matrix, CQ, VISITED and

NQ are partitioned and assigned to each

processor in the following manner

VISITED

●

-

●

(24)

2D Partitioning

- Obtaining Edge Frontier

×

Processors S B C D E F G H

S ● ● ●

B ● ●

C ● ● ● ●

D ● ●

E ● ● ● ●

F ● ● ●

G ● ●

H ● ●

 Obtain edge frontier by multiplying adjacency (sparse) matrix and CQ vector

 This requires each processor in a processor column of processor mesh to ex change CQ and have the same copy of CQ (called “Expand” described later)

CQ ● ● ●

S B C D E F G H VISITED

●

-

●

(25)

2D Partitioning

- Excluding Visited vertices from edge frontier

VISITED

●

-

●

Processors B C D

● ● ●

●

● ●

NQ

●

→

S B C D E F G



After identifying the edge frontier, NQ is obtained

by excluding visited vertices from edge frontier



This phase also requires the communication amo

ng processors in a processor row of processor me

sh

(26)

Communication of 2D Partitioning



There are 2 communication phases at each level.

 First Phase: Expand

 Each processor gathers CQ from the processors in a column of processor mesh

 Second Phase: Fold

 Processors exchange edge frontier in a row of processor mesh.



The number of nodes involved for communication becomes smaller

compared to 1D partitioning that requires all-to-all communication.

1D Partitioning 2D Partitioning

Expand

Expand _FoldFold All to All

All to All

(27)

Outline



_Background



Breadth-First Search and Distributed Algorithm



2D Partitioning based Breadth-First Search



Optimizations



Performance Evaluation



Conclusions

(28)

Our Proposed Approach



Our proposed approach is based upon 2D partitionin

g and further optimized with the following 3 methods

to achieve more high performance score

1.

Compressed Communication

2.

Parallelizing Sender and Receiver Processing

3.

Vertex Sorting for Improving Cache Hit Rate

(29)

Our Optimized Algorithm

Expand phase

Expand phase _{Fold phase}Fold phase

(30)

Compressed Communication

Compressed

Communication in Fold Phase

Compressed

Communication in Fold Phase

(31)

Different approach from the original 2D partitionin

g



Original 2D partitioning [Andy, SC’05]

 They use torus network.

 Their method reduces duplicate requests on each processor in the Fol d operation and decreases total message volume.

 However, reducing duplicate requests takes high computation cost.



We decided not to use this technique due to its high computatio

n cost. Therefore, we need to reduce communication data wit

h another method.

Torus Network Reducing Duplicate Request

Reducing Duplicate Request

Processor Duplicate Request

(32)

Compressed Communication

Source Destination

3

300

80

Original Vertex IDs

…

100,103, 303, 383, 123839,

…

Difference VLQ (hexadecimal)

03 AC, 02

50



We introduce compressed communic

ation to reduce communication data.



By checking CQ in order of vertex ID,

the vertices in Edge Frontier will be s

orted in ascending-order.



We encode the difference from the

prior vertex with Variable-Length Q

uantity (VLQ) which is used by the st

andard MIDI file format and Google’s

protocol buffers etc.



This method can compress 64 bit valu

e to 8 bit if the value is small enough.

(33)

Parallelizing Sender and Receiver Processin

g

Parallelizing Sender and Receiver Processing Parallelizing Sender and

Receiver Processing

(34)

Parallelizing Sender and Receiver Processing



Approaches by Prior Arts [Aydlin, SC’11] : After all edge frontier is insp

ected and stored into memory, it is exchanged among the processors.

 Inefficient: This requires large memory space. Communication and computati on can not be overlapped.



We divide the processing of the Fold phase into senders and receiver

s and parallelize them like stream processing.



We also use asynchronous communication in the Fold phase for eff

icient processing.

Sender Processing

extracts edges emanating from the vertices in CQ

Receiver Processing decodes edges and update BFS tree and NQ

Asynchronous Communication

Edge Frontier

(35)

Vertex Sorting for Improving Cache Hit Rate

Vertex Sorting to increase the locality of accessing to

VISITED

Vertex Sorting to increase the locality of accessing to

VISITED

(36)

Vertex Sorting for Improving Cache Hit Rate



Receiver processing generates irre

gular memory access to the VISITE

D, which wastes CPU time for mem

ory access waiting.



VISITED is represented as a bitmap.



The frequency of access to VISITED i

s determined by the vertex degree (th

e number of edges of each vertex).



We sort the vertices in decreasing

order of degrees to increase localit

y of memory access.



The degree distribution of Kronecker g

raph is followed by power law.

Sorting

Degree Distribution of Kronecker Graph Degree Distribution

of Kronecker Graph

VISITED bitmap

(37)

Implementation Detail



Programming Model: MPI + OpenMP



CQ, NQ, VISITED: Implemented using bitmap.



A vertex is represented using a bit of bitmap.



If a bit of bitmap is set, corresponding vertex is contained in the

queue. If not, it is not contained.



_Expand



Single call of MPI_Allgather()



Adjacency matrix



Using CSR (Compressed Sparse Row) as a memory layo

ut.

・・・ 0001010 ・・・ bitmap

(38)

Outline



_Background



Breadth-First Search and Distributed Algorithm



2D Partitioning based Breadth-First Search



Optimizations



Performance Evaluation



Conclusions

(39)

TSUBAME 2.0

Specification

CPU Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz, 6 CPU Cores (12 cores with Hyper Threading) x 2 sockets per 1 node (24 Threads)

RAM 54 GB

OS SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)

# of Total Nodes 1442 nodes (We only used up to 512 nodes) Network Topology Full-Bisection Fat-Tree Topology

Network Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)

GPGPU Three NVIDIA Fermi M2050 GPUs (*Not used) Software

GCC GCC 4.3.4 (-O3 option)

(40)

Performance Evaluation

- Comparison with Reference Implementations



We compare our optimized method with reference implementations

(Simple, Rep-CSR, and Rep-CSC).



Reference implementations do not run on large scale due to errors.



Our optimized implementation is fast and scalable.

0 256 512 768 1024

0 20000000000 40000000000 60000000000 80000000000 100000000000 120000000000

11,280,800,000.0 21,345,100,000.0

37,248,500,000.0

63,487,300,000.0

99,000,000,000.0

# of nodes

TEPS (GTEPS)

0 32 64 96 128

0 5000000000 10000000000 15000000000 20000000000

25000000000 _Optimized

Reference Simple Reference Rep- CSR

# of nodes

TEPS (GTEPS)

(41)

Performance Evaluation

- Effect of Optimizations



Compressed communication greatly boosts the performance.



Overall communication data volume with compressed comm

unication is 1/3 of the base version that does not adopt the co

mpression method.

Performance as TEPS per Node

Performance as TEPS

per Node Communication Data Volume

Communication Data Volume

20000000000 30000000000 40000000000 50000000000 60000000000

70000000000 _Base

Compress Communication replicated (reference)

Aggregated Message Size per Node (GB)

0 50000000 100000000 150000000 200000000 250000000 300000000 350000000

Reference-Simple Base

Compressed Communication Compress, Vertex Sorting

TEPS per node (MTEPS)

(42)

Performance Evaluation

- Breakdown of Execution Time

 There is communication wait time when the number of nodes is larger than 16.

 Because communication and computation is completely overlapped in our impl ementation, the performance of our optimized implementation is limited b y communication bandwidth.

 This is why vertex sorting does not show any effect with more than 16 nodes.

Performance breakdown of the overall execution time Performance breakdown of

the overall execution time

1 2 4 8 16 32 64 128 256 512 0

1 2 3 4 5 6 7 8 9

10 _wait-synch _wait-comm compute expand

Execution Time (seconds)

1 2 4 8 16 32 64 128 256 512 0

50000000 100000000 150000000 200000000 250000000 300000000

350000000 Reference-Simple Base

Compressed Communication Compress, Vertex Sorting

TEPS per node (MTEPS)

Performance as TEPS per Node

Performance as TEPS per Node

(43)

Related Work



BFS on Distributed Memory Systems

 Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-First Search Alg orithm on BlueGene/L. SC ‘05.

 Aydin Buluc and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. SC ‘11.

 They compared the performance of 1D and 2D partitioning methods.



BFS on GPUs

 Duane Merrill, Michael Garland, and Andrew Grimshaw. Scalable G PU graph traversal. PPoPP '12.



BFS on Multi-Socket CPU Systems

 Jatin Chhugani et al. Fast and Efficient Graph Traversal Algorithm for CP Us : Maximizing Single-Node Efficiency. IPDPS 2012.



BFS on Shared Memory Systems

 David A. Bader, Kamesh Madduri. Designing Multithreaded Algorithms for

(44)

Conclusions



We proposed an optimized implementation of the Graph500 bench

mark on a large-scale distributed memory environment.



Our optimized implementation is based on the level-synchronized

BFS with 2D partitioning and we proposed some optimization meth

ods such as communication compression and vertex sorting.



Our implementation can solve BFS of large-scale graph with 68.7

billion vertices and 1.1 trillion edges for 10.58 seconds with 1366

nodes and 16K CPU cores, which corresponds to 103.9 GTEPS.



Because the result of Graph500 is median TEPS among 64 BFS e

xecutions, our record is 100 GTEPS, which is ranked No.3 on Nov

2011 List.



Recent Results

 We used GPGPU for Graph500 on TSUBAME2.0 and get the performanc e of 317 GTEPS, which is ranked No.4 on Jun 2012 List.

大規模グラフ処理 鈴村研究室 大規模データ処理・ストリームコンピューティング Graph500 HPDC2012

Highly Scalable Graph Search for

the Graph500 Benchmark

Koji Ueno

and Toyotaro Suzumura

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

Large-Scale Graph Mining is Everywhere

New Graph Search Based Bechmark fo

r Ranking Supercomputers

BFS (Breadth First Search) from a singl

e vertex on a static, undirected Kroneck

er graph with average vertex degree 3

2.

Reference MPI, OpenMP and shared m

emory implementations are provided.

Graph500 (http://www.graph500.org)

Benchmark Flow

Graph (Edge List)

Generation

Graph (Edge List)

Generation

Graph Construction

Graph Construction

Breadth First Search

(Main Kernel)

Breadth First Search

(Main Kernel)

Validation for BFS tree (5

rules)

Validation for BFS tree (5

rules)

64

times

iteration

Kernel 1

Kernel 2

Evaluation criteria: TEPS (Traversed Edges Per Second) and problem s

ize that can be solved on the system.

Problem Size: SCALE

# of Vertices 2

, # of Edges 2

Graph500 (http://www.graph500.org)

Outline

Background

Breadth-First Search and Distributed Algorithm

2D Partitioning based Breadth-First Search

Optimizations

Performance Evaluation

Conclusions

Breadth-First Search

BFS tree

BFS tree

Graph

Graph

Level Synchronized BFS

Demonstration (1/5)

CQ

NQ

Neighbor vertices of the vertices in CQ a re queued into

NQ and added to the BFS tree.

Level Synchronized BFS

Demonstration (2/5)

CQ

NQ

Level Synchronized BFS

Demonstration (3/5)

After NQ and CQ are swapped, the next level is proc

大規模グラフ処理鈴村研究室大規模データ処理・ストリームコンピューティング Graph500 HPDC2012

_Background

_Background

Problem Size: _SCALE

_Background

- ^→

_Background