Graph500 への挑戦鈴村研究室大規模データ処理・ストリームコンピューティング

(1)

Performance Characterization of Graph500

on Large Scale Distributed Environment

Toyotaro Suzumura

^1,2,4

, Koji Ueno

^1,4

,

Hitoshi Sato

^1,4

, Katsuki Fujisawa

^1,3

and Satoshi Matsuoka

^1,4

1 Tokyo Institute of Technology

2 IBM Research – Tokyo, ³Chuo University, ⁴JST CREST

2011 IEEE International Symposium on Workload Characterization

(2)

Outline

 Introduction to Graph500

 Parallel BFS Algorithm

 Graph500 Reference Implementations

 Performance Characterization of Graph500 on

TSUBAME 2.0

 Concluding Remarks

(3)

Large-Scale Graph Mining is Everywhere

Internet Map

[lumeta.com] ^{Food Web}_[Martinez

’_91]

Protein Interactions [genomebiology.com] Friendship Network

[Moody ^’01]

(4)

 New Graph Search Based Benchmark for

Ranking Supercomputers

 BFS (Breadth First Search) from a single

vertex on a static, undirected Kronecker

graph with average vertex degree 16.

 Evaluation criteria: TEPS (Traversed Edges

Per Second), and problem size that can be

solved on a system, minimum execution

time.

 Problem Size: ^SCALE

– # of Vertices 2

^SCALE

, # of Edges 2

^SCALE+4

 Reference MPI, shared memory

implementations provided.

Graph500 (http://www.graph500.org)

(5)

Graph500 Ranking in June 2011

SCALE

Top-ranked score with the

latest ranking rule

(6)

Motivation of Our Work

 To conduct detailed performance analysis of Graph500

reference implementations on the currently 5

^th

-ranked (June,

2011) TSUBAME 2.0 supercomputer.

 To provide detailed analysis to other researchers that targets

the high performance score on their supercomputers

(7)

Benchmark Flow

Graph (Edge List) Generation

Graph Construction

(Conversion to any space-efficient format such as CSR or CSC)

Breadth First Search

(Reference or Customized Impl.)

Validation for BFS tree (5 rules)

64 times

iterations

Kernel 1

Kernel 2

1

2

3

1 2

3

Sampling 64 Search Keys

(8)

Kronecker Graph [Leskovic, PKDD2005]

Graph500 adopts a graph model called

“Kronecker Graph”

[Leskovic PKDD2005]

that simulates dynamic time-evolving

real network model that has the following

properties

•  Scale-free and power law

•  Small diameter, etc

Kronecker graph generation model

A: 0.57, B: 0.19 C: 0.19, D: 0.05

(9)

Outline

 Introduction to Graph500

 Parallel BFS Algorithm

 Graph500 Reference Implementations

 Performance Characterization of Graph500 on

TSUBAME 2.0

 Concluding Remarks

9 2011 IEEE International Symposium on Workload Characterization

(10)

Level-Synchronized Breadth-First Search

1 | for all vertex v in parallel do 2 | pred[v] ← ‐1;

3 | pred[r] ← 0 4 | Enqueue(CQ, r)

5 | While CQ != Empty do 6 | NQ ← empty

7 | for all u in CQ in parallel do 8 | | u ← Dequeue(CQ)

•  At any time, CQ (Current Queue)

is the set of vertices that must be

visited at the current level.

•  At level 1, CQ will contain the

neighbors of r, so at level 2, it will

contain their neighbors (the

neighboring vertices that have not

been visited at levels 0 or 1).

•  The algorithm maintains ^{NQ (Next}

Queue), containing the vertices

that should be visited at the next

level. After visiting all of the nodes

at each level, the queues CQ and

NQ are swapped.

Level-Synchronized BFS

(11)

Level 0

r

D E

F G H I J K L M N O P Q R S T U

NQ (Next Queue)

CQ (Current Queue)

NQ (Next Queue)

(12)

Level 0

r

D E

CQ

NQ

r

Adding vertex r to CQ

(13)

Level 1

r

D E

CQ

NQ

^D ^E

Retrieve vertex r from CQ,

and insert its adjacent

vertices to NQ

Level 0

Level 1

Level 2

(14)

Level 1 – Swap CQ and NQ

r

D E

CQ

NQ

D E

Swap CQ and NQ

Level 0

Level 1

Level 2

(15)

Level 2 – Find all the unfound adjacent vertices

with multi threads

r

D E

CQ

NQ

Multiple threads simultaneously retrieve

vertices B, C, D, E and inserts adjacent

vertices into NQ

^F ^G ^H ^I

J K L M

N O P Q

R S T U

Level 0

Level 1

Level 2

(16)

Outline

 Introduction to Graph500

 Parallel BFS Algorithm

 Graph500 Reference Implementations

 Performance Characterization of Graph500 on

TSUBAME 2.0

 Concluding Remarks

(17)

Graph500 Reference Implementations

 All the MPI implementations are based upon “Level-synchronized BFS”

 The benchmark includes reference MPI implementations that can be

categorized into two methods on whether CQ (Current Queue) is

replicated across all the nodes or not

Category Code Name Partitioning Adjacency matrix

format

Parallelism

Non-

replicated method

simple Horizontal CSR Single thread one_ sided Horizontal CSR Single thread

Replicated method

replicated- csr

Vertical CSR Multi-thread

replicated- csc

Vertical CSC Multi-thread

(18)

Partitioning Large-Scale Graph on Multiple Nodes

 Vertices and adjacency matrix are partitioned.

– Each processor has a block of vertices and outer edges

from own vertices.

Adjacency matrix Vertices

Processor

(19)

Partitioning Methods

Adjacency Matrix

Horizontal Partitioning (e.g. i p )

Vertical Partitioning (e.g. p i a )

Processor

(20)

Non-replication based method : simple

Adjacency matrix visited

Each node has a partitioned block of CQ that contains

only own vertices.

If visited[u] = 0 v

CQ

Vertex u is adjacent to v

NQ Dequeue

Inter processor communication

Enqueue

u

NQ becomes CQ at the next level.

(21)

Non-replication based method (contd.)

Edges

 

Each processor exchanges edges among each other with

asynchronous communication functions, MPI̲I and

MPI̲I v.

 

This involves all‒to‒all communication for exchanging all

the edge information

Processor

(22)

Replication-based method

Each processor has a replica of whole CQ.

CQ

The algorithm of this phase

is different between csr

and csc.

NQ CQ (next level) Each processor

makes a block of partitioned NQ.

Each processor sends NQ to all other processors.

 

In order to quickly ﬁnding a set of vertices to be handled at each

level, the current queue (CQ) – p p ‒

is replicated among all the processors.

(23)

Replication-based method

Processor

Vertices

 

Each processor has a set of its own vertices.

 

All processors send a copy of their own vertices to all

other processors with MPI̲A collective

operations when synchronization occurs at each level

(24)

Replication-based method: replicated-csr

Each MPI process looks at only its local vertex and investigate whether its

adjacent vertex exists in CQ, and if so, it is set to the local NQ which is

eventually all gathered among all the MPI processes

row pointer

CQ

NQ

For each local vertex v Adjacent vertex u

If CQ[u] = 1

NQ[v] ^← 1 CQ (and NQ) is a bitmap array.

v u

(25)

Replication-based method: replicated-csc

 Each MPI process has the replicated CQ that contains all of the vertex information as to whether a vertex exists in CQ. In contrast NQ only has information on the local vertices that each MPI process is handling.

 All of the vertices in CQ are processed in parallel by multiple threads spawned with OpenMP, If the vertex exists in CQ , then it finds a set of local vertices adjacent to the global vertex in CQ. The local vertex is added to NQ.

CQ

NQ

For all global vertex u

Adjacent vertex v

NQ[v] ^← 1

column pointer

It needs to look at all the global vertices, but this is contiguous memory

access when compared to random memory access in replicated-csr

(26)

Outline

 Introduction to Graph500

 Parallel BFS Algorithm

 Graph500 Reference Implementations

 Performance Characterization of Graph500 on

TSUBAME 2.0

 Concluding Remarks

(27)

TSUBAME 2.0 Super Computer

.

(28)

TSUBAME 2.0 System Configuration

(29)

TSUBAME 2.0 Specification

Specification

CPU Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz processors, 12 CPU Cores (24 cores with Hyper Threading) x 2 sockets per 1 node (24 CPU Cores)

RAM 54 GB

OS SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)

# of Total Nodes 1466 nodes (We only tested up to 128 nodes) Network Topology Full-Bisection Fat-Tree Topology

Network Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)

GPGPU Three NVIDIA Fermi M2050 GPUs (*Not used for this work) GCC and OpenMP GCC 4.3.4 (-O3 option) , OpenMP 3.0

OpenMPI OpenMPI 1.5.3, MVAPICH 1.6.1

(30)

Strong-Scaling Performance Comparison

1.00E+07 1.00E+08 1.00E+09 1.00E+10

1 2 4 8 16 32 64 128 TEPS

# of nodes simple

replicated‐csr replicated‐csc

1 10 100

1 2 4 8 16 32 64 128

Speedup against 1 node

# of nodes simple

Although replication-based methods outperform non-replication based method, none of them shows scalability with larger number of nodes and saturated around 32

nodes.

Scale 26, OpenMPI

(31)

Weak-Scaling Performance Comparison

 This experiment fixes the problem size for each node, which allows us to see how the linear scalability is achieved and how much of the performance

degradation is due to the communication and level synchronization

 All of the methods show the performance degradation with larger number of nodes in a weak-scaling setting

0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 3.50E+08 4.00E+08 4.50E+08

1 2 4 8 16 32 64

TEPS

# of nodes

simple

replicated‐csr

0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 3.50E+08 4.00E+08 4.50E+08

1 2 4 8 16 32 64

TEPS

# of nodes

simple

Scale 24 per node Scale 26 per node^*1

*¹ Scale 26 per node is the largest problem size that one node can handle (24 cores, 52 GB RAM) by consuming

17.71 GB in the CSC format

Scale : 30

Scale : 24 Scale : 26

Scale : 32

(32)

Profiling Communication Message Size

 With non-replication based method, the communication message size becomes large around the half level (total level : 8)

 With replication based method, aggregated message size become linearly larger with the number of nodes

10000000 100000000 1E+09 1E+10 1E+11 1E+12

2 4 8 16 32 64 128

# of MPI processes

Aggregated Message Size for replicated methods

0 2E+11 4E+11 6E+11 8E+11 1E+12 1.2E+12

1 2 3 4 5 6 7 8

Level (Depth)

sim‐29‐8 sim‐30‐16 sim‐31‐32

simple Replication-based method

Data size (byte) Data size (byte)

Scale 26, OpenMPI

Scale # of Nodes

1GB

1TB 1TB

(33)

Profiling Execution Time at Each Level

•  Replicated-csc and simple shows similar because the replicated-csc

implementation needs to find all of the vertices in CQ that contain all the global vertex information by checking the corresponding bit in CQ. Thus when CQ contains more vertices at higher level, its processing time increases.

•  In contrast, the replicated-csr implementation only checks whether adjacent

“local” vertices are unvisited in CQ, and thus as the number of unvisited vertices decreases , the processing time also decreases

0 2 4 6 8

1 2 3 4 5 6 7 8

ExecuIon Time (seconds)

Level

rep‐26‐1 rep‐31‐32

0 2 4 6 8

1 2 3 4 5 6 7 8

Level

csc‐26‐1 csc‐31‐32

replicated-csc replicated-csr

0 2 4 6 8

1 2 3 4 5 6 7 8

Level

sim‐26‐1 sim‐31‐32

simple

(34)

Profiling Computation, Comm., and Stall Times

0 2 4 6 8 10 12

1 2 4 8 16 32

Elapsed Time (seconds)

# of nodes

computaOon communicaOon stall

0 2 4 6 8 10 12

1 2 4 8 16 32 64

Elapsed Time (seconds)

# of nodes

computaOon communicaOon stall

replicated-csr replicated-csc

•  Communication and stall (synchronization) time grows with larger

number of nodes.

Weak-Scaling : Scale 26 per node

(35)

Outline

 Introduction to Graph500

 Parallel BFS Algorithm

 Graph500 Reference Implementations

 Performance Characterization of Graph500 on

TSUBAME 2.0

 Concluding Remarks

(36)

Related Work

 A Scalable Distributed Parallel Breadth-First Search Algorithm

on BlueGene/L [Andy, SC2005]

– Proposes 2D Partitioning Technique and optimization on BlueGene/L

 Scalable Graph Exploration on Multicore Processors

[Agarwal, SC2010]

– An efficient and scalable BFS algorithm for commodity multicore processors such as the 8-core Intel Nehalem EX processor

 Desigining Multhreaded Algorithms for Breadth-First Search

and st-connectivity on the Cray MTA-2 [Bader, ICPP 2006]

 Accelerating large graph algorithms on the GPU using CUDA

[Harish, HiPC 2007]

(37)

Concluding Remarks and Ongoing Work

 Concluding Remarks

– To demonstrate the performance characteristics of reference

implementations provided by Graph500 on commodity super

computers such as TSUBAME 2.0

– To provide a thorough guide for high performance graph search

algorithm on large-scale distributed environments

 Ongoing Work

– We designed and implemented our scalable and optimized BFS

method on TSUBAME 2.0 based upon the thorough study published in

IISWC 2011.

– Looking forward to the next Graph500 ranking list announced in

SC2011 next week 

(38)

11.3 21.3

37.2

63.5

99.0

0 20 40 60 80 100 120

0 256 512 768 1024

TEPS (GE/s)

# of nodes

Our Highly Scalable BFS Method

 We designed and implemented an optimized method based on 2D based partitioning and other various optimization methods such as communication compression and vertex sorting.

 Our optimized implementation can solve BFS (Breadth First Search) of large-scale graph with 2³⁶_（68.7 billion_）vertices and 2⁴⁰_（1.1 trillion_）edges for 10.58 seconds with 1366 nodes and 16392 CPU cores on TSUBAME 2.0

 This record corresponds to 103.9 GE/s (TEPS)

Performance of Our OpOmized ImplementaOon with Scale 26 per 1 node

0 5 10 15 20 25

0 32 64 96 128

TEPS (GE/s)

# of nodes optimized

simple

replicated-csr replicated-csc

Performance Comparison with Reference ImplementaOons (simple, replicated‐csr and replicated‐csc) and Scale 24 per 1 node

Graph500 への挑戦 鈴村研究室 大規模データ処理・ストリームコンピューティング

Performance Characterization of Graph500

on Large Scale Distributed Environment

Toyotaro Suzumura

, Koji Ueno

,

Hitoshi Sato

, Katsuki Fujisawa

and Satoshi Matsuoka

Outline

 Introduction to Graph500

 Parallel BFS Algorithm

 Graph500 Reference Implementations

 Performance Characterization of Graph500 on

TSUBAME 2.0

 Concluding Remarks

Large-Scale Graph Mining is Everywhere

 New Graph Search Based Benchmark for

Ranking Supercomputers

 BFS (Breadth First Search) from a single

vertex on a static, undirected Kronecker

graph with average vertex degree 16.

 Evaluation criteria: TEPS (Traversed Edges

Per Second), and problem size that can be

solved on a system, minimum execution

time.

 Problem Size: SCALE

– # of Vertices 2

, # of Edges 2

 Reference MPI, shared memory

implementations provided.

Graph500 (http://www.graph500.org)

Graph500 Ranking in June 2011

SCALE

Top-ranked score with the

latest ranking rule

Motivation of Our Work

 To conduct detailed performance analysis of Graph500

reference implementations on the currently 5

-ranked (June,

2011) TSUBAME 2.0 supercomputer.

 To provide detailed analysis to other researchers that targets

the high performance score on their supercomputers

Benchmark Flow

Graph (Edge List) Generation

Graph Construction

Breadth First Search

(Reference or Customized Impl.)

Validation for BFS tree (5 rules)

64 times

iterations

Kernel 1

Kernel 2

Kronecker Graph [Leskovic, PKDD2005]

Graph500 adopts a graph model called

“Kronecker Graph”

that simulates dynamic time-evolving

real network model that has the following

properties

• Scale-free and power law

• Small diameter, etc

Outline

 Introduction to Graph500

 Parallel BFS Algorithm

 Graph500 Reference Implementations

 Performance Characterization of Graph500 on

TSUBAME 2.0

 Concluding Remarks

Level-Synchronized Breadth-First Search

• At any time, CQ (Current Queue)

is the set of vertices that must be

visited at the current level.

• At level 1, CQ will contain the

neighbors of r, so at level 2, it will

contain their neighbors (the

neighboring vertices that have not

been visited at levels 0 or 1).

• The algorithm maintains NQ (Next

Queue), containing the vertices

that should be visited at the next

Graph500 への挑戦鈴村研究室大規模データ処理・ストリームコンピューティング

 Problem Size: ^SCALE

•  Scale-free and power law

•  Small diameter, etc

•  At any time, CQ (Current Queue)

•  At level 1, CQ will contain the

•  The algorithm maintains ^{NQ (Next}