• 検索結果がありません。

Graph500 への挑戦 鈴村研究室 大規模データ処理・ストリームコンピューティング

N/A
N/A
Protected

Academic year: 2018

シェア "Graph500 への挑戦 鈴村研究室 大規模データ処理・ストリームコンピューティング"

Copied!
39
0
0

読み込み中.... (全文を見る)

全文

(1)

Performance Characterization of Graph500

on Large Scale Distributed Environment

Toyotaro Suzumura

1,2,4

, Koji Ueno

1,4

,

Hitoshi Sato

1,4

, Katsuki Fujisawa

1,3

and Satoshi Matsuoka

1,4

1 Tokyo Institute of Technology

2 IBM Research – Tokyo, 3 Chuo University, 4 JST CREST

2011 IEEE International Symposium on Workload Characterization

(2)

Outline

Introduction to Graph500

Parallel BFS Algorithm

Graph500 Reference Implementations

Performance Characterization of Graph500 on

TSUBAME 2.0

Concluding Remarks

2011 IEEE International Symposium on Workload Characterization

(3)

Large-Scale Graph Mining is Everywhere

Internet Map

[lumeta.com] Food Web [Martinez

91]

Protein Interactions [genomebiology.com] Friendship Network

[Moody 01]

(4)

New Graph Search Based Benchmark for

Ranking Supercomputers

BFS (Breadth First Search) from a single

vertex on a static, undirected Kronecker

graph with average vertex degree 16.

Evaluation criteria: TEPS (Traversed Edges

Per Second), and problem size that can be

solved on a system, minimum execution

time.

Problem Size: SCALE

# of Vertices 2

SCALE

, # of Edges 2

SCALE+4

Reference MPI, shared memory

implementations provided.

Graph500 (http://www.graph500.org)

(5)

Graph500 Ranking in June 2011

2011 IEEE International Symposium on Workload Characterization

SCALE

Top-ranked score with the

latest ranking rule

(6)

Motivation of Our Work

To conduct detailed performance analysis of Graph500

reference implementations on the currently 5

th

-ranked (June,

2011) TSUBAME 2.0 supercomputer.

To provide detailed analysis to other researchers that targets

the high performance score on their supercomputers

(7)

Benchmark Flow

Graph (Edge List) Generation

Graph Construction

(Conversion to any space-efficient format such as CSR or CSC)

Breadth First Search

(Reference or Customized Impl.)

Validation for BFS tree (5 rules)

64 times

iterations

Kernel 1

Kernel 2

1

2

2

3

1 2

3

Sampling 64 Search Keys

(8)

Kronecker Graph [Leskovic, PKDD2005]

Graph500 adopts a graph model called

“Kronecker Graph”

[Leskovic PKDD2005]

that simulates dynamic time-evolving

real network model that has the following

properties

•  Scale-free and power law

•  Small diameter, etc

Kronecker graph generation model

A: 0.57, B: 0.19 C: 0.19, D: 0.05

(9)

Outline

Introduction to Graph500

Parallel BFS Algorithm

Graph500 Reference Implementations

Performance Characterization of Graph500 on

TSUBAME 2.0

Concluding Remarks

9 2011 IEEE International Symposium on Workload Characterization

(10)

Level-Synchronized Breadth-First Search

1  |  for all vertex v in parallel do  2  |       pred[v] ← ‐1; 

3  |  pred[r] ← 0  4  | Enqueue(CQ, r) 

5  | While CQ != Empty do   6  |      NQ ← empty 

7  |       for all u in CQ in parallel do  8  |         |    u ← Dequeue(CQ) 

9  |         | for each v adjacent to u in parallel do   10|         |       |  if pred[v] = ‐1 then  

11|         |        |    |   pred[v] ← u;  12|         |        |    |   Enqueue(NQ, v)  13|       swap(CQ, NQ); 

•  At any time, CQ (Current Queue)

is the set of vertices that must be

visited at the current level.

•  At level 1, CQ will contain the

neighbors of r, so at level 2, it will

contain their neighbors (the

neighboring vertices that have not

been visited at levels 0 or 1).

•  The algorithm maintains NQ (Next

Queue), containing the vertices

that should be visited at the next

level. After visiting all of the nodes

at each level, the queues CQ and

NQ are swapped.

Level-Synchronized BFS

(11)

Level 0

r

D E

F G H I J K L M N O P Q R S T U

NQ (Next Queue)

CQ (Current Queue)

NQ (Next Queue)

(12)

Level 0

r

D E

F G H I J K L M N O P Q R S T U

CQ

NQ

r

Adding vertex r to CQ

(13)

Level 1

r

D E

F G H I J K L M N O P Q R S T U

CQ

NQ

D E

Retrieve vertex r from CQ,

and insert its adjacent

vertices to NQ

Level 0

Level 1

Level 2

(14)

Level 1 – Swap CQ and NQ

r

D E

F G H I J K L M N O P Q R S T U

CQ

NQ

D E

Swap CQ and NQ

Level 0

Level 1

Level 2

(15)

Level 2 – Find all the unfound adjacent vertices

with multi threads

r

D E

F G H I J K L M N O P Q R S T U

CQ

NQ

Multiple threads simultaneously retrieve

vertices B, C, D, E and inserts adjacent

vertices into NQ

F G H I

J K L M

N O P Q

R S T U

Level 0

Level 1

Level 2

(16)

Outline

Introduction to Graph500

Parallel BFS Algorithm

Graph500 Reference Implementations

Performance Characterization of Graph500 on

TSUBAME 2.0

Concluding Remarks

2011 IEEE International Symposium on Workload Characterization

(17)

Graph500 Reference Implementations

All the MPI implementations are based upon “Level-synchronized BFS”

The benchmark includes reference MPI implementations that can be

categorized into two methods on whether CQ (Current Queue) is

replicated across all the nodes or not

2011 IEEE International Symposium on Workload Characterization

Category Code Name Partitioning Adjacency matrix

format

Parallelism

Non-

replicated method

simple Horizontal CSR Single thread one_ sided Horizontal CSR Single thread

Replicated method

replicated- csr

Vertical CSR Multi-thread

replicated- csc

Vertical CSC Multi-thread

(18)

Partitioning Large-Scale Graph on Multiple Nodes

Vertices and adjacency matrix are partitioned.

Each processor has a block of vertices and outer edges

from own vertices.

Adjacency matrix Vertices

Processor

(19)

Partitioning Methods

Adjacency Matrix

Horizontal Partitioning (e.g.  i p ) 

Vertical Partitioning (e.g.  p i a )

Processor

(20)

Non-replication based method : simple

Adjacency matrix visited

Each node has a partitioned block of CQ that contains

only own vertices.

If visited[u] = 0 v

CQ

Vertex u is adjacent to v

NQ Dequeue

Inter processor communication

Enqueue

u

NQ becomes CQ at the next level.

(21)

Non-replication based method (contd.)

Edges

 

Each processor exchanges edges among each other with 

asynchronous communication functions, MPI̲I  and 

MPI̲I v. 

 

This involves all‒to‒all communication for exchanging all 

the edge information 

Processor

(22)

Replication-based method

Each processor has a replica of whole CQ.

CQ

The algorithm of this phase

is different between csr

and csc.

NQ CQ (next level) Each processor

makes a block of partitioned NQ.

Each processor sends NQ to all other processors.

 

In order to quickly finding a set of vertices to be handled at each 

level, the current queue (CQ) –      p       p ‒ 

is replicated among all the processors. 

(23)

Replication-based method

Processor

Vertices

 

Each processor has a set of its own vertices.

 

All processors send a copy of their own vertices to all 

other processors with MPI̲A  collective 

operations when synchronization occurs at each level 

(24)

Replication-based method: replicated-csr

Each MPI process looks at only its local vertex and investigate whether its

adjacent vertex exists in CQ, and if so, it is set to the local NQ which is

eventually all gathered among all the MPI processes

row pointer

CQ

NQ

For each local vertex v Adjacent vertex u

If CQ[u] = 1

NQ[v] 1 CQ (and NQ) is a bitmap array.

v u

(25)

Replication-based method: replicated-csc

Each MPI process has the replicated CQ that contains all of the vertex information as to whether a vertex exists in CQ. In contrast NQ only has information on the local vertices that each MPI process is handling.

All of the vertices in CQ are processed in parallel by multiple threads spawned with OpenMP, If the vertex exists in CQ , then it finds a set of local vertices adjacent to the global vertex in CQ. The local vertex is added to NQ.

CQ

NQ

For all global vertex u

Adjacent vertex v

NQ[v] 1

column pointer

It needs to look at all the global vertices, but this is contiguous memory

access when compared to random memory access in replicated-csr

(26)

Outline

Introduction to Graph500

Parallel BFS Algorithm

Graph500 Reference Implementations

Performance Characterization of Graph500 on

TSUBAME 2.0

Concluding Remarks

2011 IEEE International Symposium on Workload Characterization

(27)

TSUBAME 2.0 Super Computer

.

(28)

TSUBAME 2.0 System Configuration

(29)

TSUBAME 2.0 Specification

Specification

CPU Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz processors, 12 CPU Cores (24 cores with Hyper Threading) x 2 sockets per 1 node (24 CPU Cores)

RAM 54 GB

OS SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)

# of Total Nodes 1466 nodes (We only tested up to 128 nodes) Network Topology Full-Bisection Fat-Tree Topology

Network Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)

GPGPU Three NVIDIA Fermi M2050 GPUs (*Not used for this work) GCC and OpenMP GCC 4.3.4 (-O3 option) , OpenMP 3.0

OpenMPI OpenMPI 1.5.3, MVAPICH 1.6.1

(30)

Strong-Scaling Performance Comparison

2011 IEEE International Symposium on Workload Characterization

1.00E+07  1.00E+08  1.00E+09  1.00E+10 

16  32  64  128  TEPS

# of nodes simple 

replicated‐csr  replicated‐csc 

10  100 

16  32  64  128 

Speedup against 1 node

# of nodes simple 

replicated‐csr  replicated‐csc 

Although replication-based methods outperform non-replication based method, none of them shows scalability with larger number of nodes and saturated around 32

nodes.

Scale 26, OpenMPI

(31)

Weak-Scaling Performance Comparison

This experiment fixes the problem size for each node, which allows us to see how the linear scalability is achieved and how much of the performance

degradation is due to the communication and level synchronization

All of the methods show the performance degradation with larger number of nodes in a weak-scaling setting

2011 IEEE International Symposium on Workload Characterization

0.00E+00  5.00E+07  1.00E+08  1.50E+08  2.00E+08  2.50E+08  3.00E+08  3.50E+08  4.00E+08  4.50E+08 

16  32  64 

TEPS

# of nodes

simple 

replicated‐csr 

0.00E+00  5.00E+07  1.00E+08  1.50E+08  2.00E+08  2.50E+08  3.00E+08  3.50E+08  4.00E+08  4.50E+08 

16  32  64 

TEPS

# of nodes

simple 

replicated‐csr  replicated‐csc 

Scale 24 per node Scale 26 per node*1

*1 Scale 26 per node is the largest problem size that one node can handle (24 cores, 52 GB RAM) by consuming

17.71 GB in the CSC format

Scale : 30

Scale : 24 Scale : 26

Scale : 32

(32)

Profiling Communication Message Size

With non-replication based method, the communication message size becomes large around the half level (total level : 8)

With replication based method, aggregated message size become linearly larger with the number of nodes

2011 IEEE International Symposium on Workload Characterization

10000000  100000000  1E+09  1E+10  1E+11  1E+12 

16  32  64  128 

# of MPI processes 

Aggregated Message Size for  replicated methods 

2E+11  4E+11  6E+11  8E+11  1E+12  1.2E+12 

Level (Depth)

sim‐29‐8  sim‐30‐16  sim‐31‐32 

simple Replication-based method

Data size (byte) Data size (byte)

Scale 26, OpenMPI

Scale # of Nodes

1GB

1TB 1TB

(33)

Profiling Execution Time at Each Level

2011 IEEE International Symposium on Workload Characterization

•  Replicated-csc and simple shows similar because the replicated-csc

implementation needs to find all of the vertices in CQ that contain all the global vertex information by checking the corresponding bit in CQ. Thus when CQ contains more vertices at higher level, its processing time increases.

•  In contrast, the replicated-csr implementation only checks whether adjacent

“local” vertices are unvisited in CQ, and thus as the number of unvisited vertices decreases , the processing time also decreases

ExecuIon Time (seconds)

Level

rep‐26‐1  rep‐31‐32 

1  2  3  4  5  6  7  8 

ExecuIon Time (seconds)

Level

csc‐26‐1  csc‐31‐32 

replicated-csc replicated-csr

1  2  3  4  5  6  7  8 

ExecuIon Time (seconds)

Level

sim‐26‐1  sim‐31‐32 

simple

(34)

Profiling Computation, Comm., and Stall Times

2011 IEEE International Symposium on Workload Characterization

10  12 

1  2  4  8  16  32 

Elapsed Time (seconds)

# of nodes

computaOon  communicaOon  stall 

0  2  4  6  8  10  12 

1  2  4  8  16  32  64 

Elapsed Time (seconds)

# of nodes

computaOon  communicaOon  stall 

replicated-csr replicated-csc

•  Communication and stall (synchronization) time grows with larger

number of nodes.

Weak-Scaling : Scale 26 per node

(35)

Outline

Introduction to Graph500

Parallel BFS Algorithm

Graph500 Reference Implementations

Performance Characterization of Graph500 on

TSUBAME 2.0

Concluding Remarks

2011 IEEE International Symposium on Workload Characterization

(36)

Related Work

A Scalable Distributed Parallel Breadth-First Search Algorithm

on BlueGene/L [Andy, SC2005]

Proposes 2D Partitioning Technique and optimization on BlueGene/L

Scalable Graph Exploration on Multicore Processors

[Agarwal, SC2010]

An efficient and scalable BFS algorithm for commodity multicore processors such as the 8-core Intel Nehalem EX processor

Desigining Multhreaded Algorithms for Breadth-First Search

and st-connectivity on the Cray MTA-2 [Bader, ICPP 2006]

Accelerating large graph algorithms on the GPU using CUDA

[Harish, HiPC 2007]

(37)

Concluding Remarks and Ongoing Work

Concluding Remarks

To demonstrate the performance characteristics of reference

implementations provided by Graph500 on commodity super

computers such as TSUBAME 2.0

To provide a thorough guide for high performance graph search

algorithm on large-scale distributed environments

Ongoing Work

We designed and implemented our scalable and optimized BFS

method on TSUBAME 2.0 based upon the thorough study published in

IISWC 2011.

Looking forward to the next Graph500 ranking list announced in

SC2011 next week 

2011 IEEE International Symposium on Workload Characterization

(38)

11.3 21.3

37.2

63.5

99.0

0 20 40 60 80 100 120

0 256 512 768 1024

TEPS (GE/s)

# of nodes

Our Highly Scalable BFS Method

We designed and implemented an optimized method based on 2D based partitioning and other various optimization methods such as communication compression and vertex sorting.

Our optimized implementation can solve BFS (Breadth First Search) of large-scale graph with 23668.7 billionvertices and 2401.1 trillionedges for 10.58 seconds with 1366 nodes and 16392 CPU cores on TSUBAME 2.0

This record corresponds to 103.9 GE/s (TEPS)

Performance of Our OpOmized ImplementaOon   with Scale 26 per 1 node 

0 5 10 15 20 25

0 32 64 96 128

TEPS (GE/s)

# of nodes optimized

simple

replicated-csr replicated-csc

Performance Comparison with Reference ImplementaOons  (simple, replicated‐csr and replicated‐csc) and Scale 24 per 1 node 

(39)

9

Questions

? ?

Thank You

2011 IEEE International Symposium on Workload Characterization

参照

関連したドキュメント

Analogs of this theorem were proved by Roitberg for nonregular elliptic boundary- value problems and for general elliptic systems of differential equations, the mod- ified scale of

Then it follows immediately from a suitable version of “Hensel’s Lemma” [cf., e.g., the argument of [4], Lemma 2.1] that S may be obtained, as the notation suggests, as the m A

Definition An embeddable tiled surface is a tiled surface which is actually achieved as the graph of singular leaves of some embedded orientable surface with closed braid

This paper gives a decomposition of the characteristic polynomial of the adjacency matrix of the tree T (d, k, r) , obtained by attaching copies of B(d, k) to the vertices of

Correspondingly, the limiting sequence of metric spaces has a surpris- ingly simple description as a collection of random real trees (given below) in which certain pairs of

If a non-saturated subset in the set of weights of the kth fundamental representation of SL(n) is found, then the analogous non-saturated subset exists in the set of weights of the

The pair ( Q , P ) is then identified with one of the diagrams in this set. To carry it out, start by forming the diagram with P in the top a rows and Q below it. If all violations

For each path of an extended formation connecting vertices in the inner area to vertices in the outer area, consider a vertex, called turning vertex, which is placed in cs b and