Performance Characterization of Graph500
on Large Scale Distributed Environment
Toyotaro Suzumura
1,2,4, Koji Ueno
1,4,
Hitoshi Sato
1,4, Katsuki Fujisawa
1,3and Satoshi Matsuoka
1,41 Tokyo Institute of Technology
2 IBM Research – Tokyo, 3 Chuo University, 4 JST CREST
2011 IEEE International Symposium on Workload Characterization
Outline
Introduction to Graph500
Parallel BFS Algorithm
Graph500 Reference Implementations
Performance Characterization of Graph500 on
TSUBAME 2.0
Concluding Remarks
2011 IEEE International Symposium on Workload Characterization
Large-Scale Graph Mining is Everywhere
Internet Map
[lumeta.com] Food Web [Martinez
’91]
Protein Interactions [genomebiology.com] Friendship Network
[Moody ’01]
New Graph Search Based Benchmark for
Ranking Supercomputers
BFS (Breadth First Search) from a single
vertex on a static, undirected Kronecker
graph with average vertex degree 16.
Evaluation criteria: TEPS (Traversed Edges
Per Second), and problem size that can be
solved on a system, minimum execution
time.
Problem Size: SCALE
– # of Vertices 2
SCALE, # of Edges 2
SCALE+4 Reference MPI, shared memory
implementations provided.
Graph500 (http://www.graph500.org)
Graph500 Ranking in June 2011
2011 IEEE International Symposium on Workload Characterization
SCALE
Top-ranked score with the
latest ranking rule
Motivation of Our Work
To conduct detailed performance analysis of Graph500
reference implementations on the currently 5
th-ranked (June,
2011) TSUBAME 2.0 supercomputer.
To provide detailed analysis to other researchers that targets
the high performance score on their supercomputers
Benchmark Flow
Graph (Edge List) Generation
Graph Construction
(Conversion to any space-efficient format such as CSR or CSC)
Breadth First Search
(Reference or Customized Impl.)
Validation for BFS tree (5 rules)
64 times
iterations
Kernel 1
Kernel 2
1
2
2
3
1 2
3
Sampling 64 Search Keys
Kronecker Graph [Leskovic, PKDD2005]
Graph500 adopts a graph model called
“Kronecker Graph”
[Leskovic PKDD2005]that simulates dynamic time-evolving
real network model that has the following
properties
• Scale-free and power law
• Small diameter, etc
Kronecker graph generation model
A: 0.57, B: 0.19 C: 0.19, D: 0.05
Outline
Introduction to Graph500
Parallel BFS Algorithm
Graph500 Reference Implementations
Performance Characterization of Graph500 on
TSUBAME 2.0
Concluding Remarks
9 2011 IEEE International Symposium on Workload Characterization
Level-Synchronized Breadth-First Search
1 | for all vertex v in parallel do 2 | pred[v] ← ‐1;
3 | pred[r] ← 0 4 | Enqueue(CQ, r)
5 | While CQ != Empty do 6 | NQ ← empty
7 | for all u in CQ in parallel do 8 | | u ← Dequeue(CQ)
9 | | for each v adjacent to u in parallel do 10| | | if pred[v] = ‐1 then
11| | | | pred[v] ← u; 12| | | | Enqueue(NQ, v) 13| swap(CQ, NQ);
• At any time, CQ (Current Queue)
is the set of vertices that must be
visited at the current level.
• At level 1, CQ will contain the
neighbors of r, so at level 2, it will
contain their neighbors (the
neighboring vertices that have not
been visited at levels 0 or 1).
• The algorithm maintains NQ (Next
Queue), containing the vertices
that should be visited at the next
level. After visiting all of the nodes
at each level, the queues CQ and
NQ are swapped.
Level-Synchronized BFS
Level 0
r
D E
F G H I J K L M N O P Q R S T U
NQ (Next Queue)
CQ (Current Queue)
NQ (Next Queue)
Level 0
r
D E
F G H I J K L M N O P Q R S T U
CQ
NQ
r
Adding vertex r to CQ
Level 1
r
D E
F G H I J K L M N O P Q R S T U
CQ
NQ
D ERetrieve vertex r from CQ,
and insert its adjacent
vertices to NQ
Level 0
Level 1
Level 2
Level 1 – Swap CQ and NQ
r
D E
F G H I J K L M N O P Q R S T U
CQ
NQ
D E
Swap CQ and NQ
Level 0
Level 1
Level 2
Level 2 – Find all the unfound adjacent vertices
with multi threads
r
D E
F G H I J K L M N O P Q R S T U
CQ
NQ
Multiple threads simultaneously retrieve
vertices B, C, D, E and inserts adjacent
vertices into NQ
F G H IJ K L M
N O P Q
R S T U
Level 0
Level 1
Level 2
Outline
Introduction to Graph500
Parallel BFS Algorithm
Graph500 Reference Implementations
Performance Characterization of Graph500 on
TSUBAME 2.0
Concluding Remarks
2011 IEEE International Symposium on Workload Characterization
Graph500 Reference Implementations
All the MPI implementations are based upon “Level-synchronized BFS”
The benchmark includes reference MPI implementations that can be
categorized into two methods on whether CQ (Current Queue) is
replicated across all the nodes or not
2011 IEEE International Symposium on Workload Characterization
Category Code Name Partitioning Adjacency matrix
format
Parallelism
Non-
replicated method
simple Horizontal CSR Single thread one_ sided Horizontal CSR Single thread
Replicated method
replicated- csr
Vertical CSR Multi-thread
replicated- csc
Vertical CSC Multi-thread
Partitioning Large-Scale Graph on Multiple Nodes
Vertices and adjacency matrix are partitioned.
– Each processor has a block of vertices and outer edges
from own vertices.
Adjacency matrix Vertices
Processor
Partitioning Methods
Adjacency Matrix
Horizontal Partitioning (e.g. i p )
Vertical Partitioning (e.g. p i a )
Processor
Non-replication based method : simple
Adjacency matrix visited
Each node has a partitioned block of CQ that contains
only own vertices.
If visited[u] = 0 v
CQ
Vertex u is adjacent to v
NQ Dequeue
Inter processor communication
Enqueue
u
NQ becomes CQ at the next level.
Non-replication based method (contd.)
Edges
Each processor exchanges edges among each other with
asynchronous communication functions, MPI̲I and
MPI̲I v.
This involves all‒to‒all communication for exchanging all
the edge information
Processor
Replication-based method
Each processor has a replica of whole CQ.
CQ
The algorithm of this phase
is different between csr
and csc.
NQ CQ (next level) Each processor
makes a block of partitioned NQ.
Each processor sends NQ to all other processors.
In order to quickly finding a set of vertices to be handled at each
level, the current queue (CQ) – p p ‒
is replicated among all the processors.
Replication-based method
Processor
Vertices
Each processor has a set of its own vertices.
All processors send a copy of their own vertices to all
other processors with MPI̲A collective
operations when synchronization occurs at each level
Replication-based method: replicated-csr
Each MPI process looks at only its local vertex and investigate whether its
adjacent vertex exists in CQ, and if so, it is set to the local NQ which is
eventually all gathered among all the MPI processes
row pointer
CQ
NQ
For each local vertex v Adjacent vertex u
If CQ[u] = 1
NQ[v] ← 1 CQ (and NQ) is a bitmap array.
v u
Replication-based method: replicated-csc
Each MPI process has the replicated CQ that contains all of the vertex information as to whether a vertex exists in CQ. In contrast NQ only has information on the local vertices that each MPI process is handling.
All of the vertices in CQ are processed in parallel by multiple threads spawned with OpenMP, If the vertex exists in CQ , then it finds a set of local vertices adjacent to the global vertex in CQ. The local vertex is added to NQ.
CQ
NQ
For all global vertex u
Adjacent vertex v
NQ[v] ← 1
column pointer
It needs to look at all the global vertices, but this is contiguous memory
access when compared to random memory access in replicated-csr
Outline
Introduction to Graph500
Parallel BFS Algorithm
Graph500 Reference Implementations
Performance Characterization of Graph500 on
TSUBAME 2.0
Concluding Remarks
2011 IEEE International Symposium on Workload Characterization
TSUBAME 2.0 Super Computer
.
TSUBAME 2.0 System Configuration
TSUBAME 2.0 Specification
Specification
CPU Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz processors, 12 CPU Cores (24 cores with Hyper Threading) x 2 sockets per 1 node (24 CPU Cores)
RAM 54 GB
OS SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)
# of Total Nodes 1466 nodes (We only tested up to 128 nodes) Network Topology Full-Bisection Fat-Tree Topology
Network Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)
GPGPU Three NVIDIA Fermi M2050 GPUs (*Not used for this work) GCC and OpenMP GCC 4.3.4 (-O3 option) , OpenMP 3.0
OpenMPI OpenMPI 1.5.3, MVAPICH 1.6.1
Strong-Scaling Performance Comparison
2011 IEEE International Symposium on Workload Characterization
1.00E+07 1.00E+08 1.00E+09 1.00E+10
1 2 4 8 16 32 64 128 TEPS
# of nodes simple
replicated‐csr replicated‐csc
1 10 100
1 2 4 8 16 32 64 128
Speedup against 1 node
# of nodes simple
replicated‐csr replicated‐csc
Although replication-based methods outperform non-replication based method, none of them shows scalability with larger number of nodes and saturated around 32
nodes.
Scale 26, OpenMPI
Weak-Scaling Performance Comparison
This experiment fixes the problem size for each node, which allows us to see how the linear scalability is achieved and how much of the performance
degradation is due to the communication and level synchronization
All of the methods show the performance degradation with larger number of nodes in a weak-scaling setting
2011 IEEE International Symposium on Workload Characterization
0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 3.50E+08 4.00E+08 4.50E+08
1 2 4 8 16 32 64
TEPS
# of nodes
simple
replicated‐csr
0.00E+00 5.00E+07 1.00E+08 1.50E+08 2.00E+08 2.50E+08 3.00E+08 3.50E+08 4.00E+08 4.50E+08
1 2 4 8 16 32 64
TEPS
# of nodes
simple
replicated‐csr replicated‐csc
Scale 24 per node Scale 26 per node*1
*1 Scale 26 per node is the largest problem size that one node can handle (24 cores, 52 GB RAM) by consuming
17.71 GB in the CSC format
Scale : 30
Scale : 24 Scale : 26
Scale : 32
Profiling Communication Message Size
With non-replication based method, the communication message size becomes large around the half level (total level : 8)
With replication based method, aggregated message size become linearly larger with the number of nodes
2011 IEEE International Symposium on Workload Characterization
10000000 100000000 1E+09 1E+10 1E+11 1E+12
2 4 8 16 32 64 128
# of MPI processes
Aggregated Message Size for replicated methods
0 2E+11 4E+11 6E+11 8E+11 1E+12 1.2E+12
1 2 3 4 5 6 7 8
Level (Depth)
sim‐29‐8 sim‐30‐16 sim‐31‐32
simple Replication-based method
Data size (byte) Data size (byte)
Scale 26, OpenMPI
Scale # of Nodes
1GB
1TB 1TB
Profiling Execution Time at Each Level
2011 IEEE International Symposium on Workload Characterization
• Replicated-csc and simple shows similar because the replicated-csc
implementation needs to find all of the vertices in CQ that contain all the global vertex information by checking the corresponding bit in CQ. Thus when CQ contains more vertices at higher level, its processing time increases.
• In contrast, the replicated-csr implementation only checks whether adjacent
“local” vertices are unvisited in CQ, and thus as the number of unvisited vertices decreases , the processing time also decreases
0 2 4 6 8
1 2 3 4 5 6 7 8
ExecuIon Time (seconds)
Level
rep‐26‐1 rep‐31‐32
0 2 4 6 8
1 2 3 4 5 6 7 8
ExecuIon Time (seconds)
Level
csc‐26‐1 csc‐31‐32
replicated-csc replicated-csr
0 2 4 6 8
1 2 3 4 5 6 7 8
ExecuIon Time (seconds)
Level
sim‐26‐1 sim‐31‐32
simple
Profiling Computation, Comm., and Stall Times
2011 IEEE International Symposium on Workload Characterization
0 2 4 6 8 10 12
1 2 4 8 16 32
Elapsed Time (seconds)
# of nodes
computaOon communicaOon stall
0 2 4 6 8 10 12
1 2 4 8 16 32 64
Elapsed Time (seconds)
# of nodes
computaOon communicaOon stall
replicated-csr replicated-csc
• Communication and stall (synchronization) time grows with larger
number of nodes.
Weak-Scaling : Scale 26 per node
Outline
Introduction to Graph500
Parallel BFS Algorithm
Graph500 Reference Implementations
Performance Characterization of Graph500 on
TSUBAME 2.0
Concluding Remarks
2011 IEEE International Symposium on Workload Characterization
Related Work
A Scalable Distributed Parallel Breadth-First Search Algorithm
on BlueGene/L [Andy, SC2005]
– Proposes 2D Partitioning Technique and optimization on BlueGene/L
Scalable Graph Exploration on Multicore Processors
[Agarwal, SC2010]
– An efficient and scalable BFS algorithm for commodity multicore processors such as the 8-core Intel Nehalem EX processor
Desigining Multhreaded Algorithms for Breadth-First Search
and st-connectivity on the Cray MTA-2 [Bader, ICPP 2006]
Accelerating large graph algorithms on the GPU using CUDA
[Harish, HiPC 2007]
Concluding Remarks and Ongoing Work
Concluding Remarks
– To demonstrate the performance characteristics of reference
implementations provided by Graph500 on commodity super
computers such as TSUBAME 2.0
– To provide a thorough guide for high performance graph search
algorithm on large-scale distributed environments
Ongoing Work
– We designed and implemented our scalable and optimized BFS
method on TSUBAME 2.0 based upon the thorough study published in
IISWC 2011.
– Looking forward to the next Graph500 ranking list announced in
SC2011 next week
2011 IEEE International Symposium on Workload Characterization
11.3 21.3
37.2
63.5
99.0
0 20 40 60 80 100 120
0 256 512 768 1024
TEPS (GE/s)
# of nodes
Our Highly Scalable BFS Method
We designed and implemented an optimized method based on 2D based partitioning and other various optimization methods such as communication compression and vertex sorting.
Our optimized implementation can solve BFS (Breadth First Search) of large-scale graph with 236(68.7 billion)vertices and 240(1.1 trillion)edges for 10.58 seconds with 1366 nodes and 16392 CPU cores on TSUBAME 2.0
This record corresponds to 103.9 GE/s (TEPS)
Performance of Our OpOmized ImplementaOon with Scale 26 per 1 node
0 5 10 15 20 25
0 32 64 96 128
TEPS (GE/s)
# of nodes optimized
simple
replicated-csr replicated-csc
Performance Comparison with Reference ImplementaOons (simple, replicated‐csr and replicated‐csc) and Scale 24 per 1 node
9
Questions
? ?
Thank You
2011 IEEE International Symposium on Workload Characterization