Highly Scalable Graph Search for
the Graph500 Benchmark
Koji Ueno
1,2and Toyotaro Suzumura
1,2,3Outline
Background
Breadth-First Search and Distributed Algorithm
2D Partitioning based Breadth-First Search
Optimizations
Performance Evaluation
Conclusions
Outline
Background
Breadth-First Search and Distributed Algorithm
2D Partitioning based Breadth-First Search
Optimizations
Performance Evaluation
Conclusions
Large-Scale Graph Mining is Everywhere
Internet Map
Symbolic Networks: Human Brain
Protein Interactions Social Networks
Cybersecurity
Medical Informatics Data Enrichment Social Networks Symbolic Networks
New Graph Search Based Bechmark fo
r Ranking Supercomputers
BFS (Breadth First Search) from a singl
e vertex on a static, undirected Kroneck
er graph with average vertex degree 3
2.
Reference MPI, OpenMP and shared m
emory implementations are provided.
Graph500 (http://www.graph500.org)
Benchmark Flow
Graph (Edge List)
Generation
Graph (Edge List)
Generation
Graph Construction
(Conversion to any space-efficient format such as CSR or CSC)
Graph Construction
(Conversion to any space-efficient format such as CSR or CSC)
Breadth First Search
(Main Kernel)
Breadth First Search
(Main Kernel)
Validation for BFS tree (5
rules)
Validation for BFS tree (5
rules)
64
times
iteration
Kernel 1
Kernel 2
11
22 22
33
11 22
33 Sampling 64 Search Keys
Evaluation criteria: TEPS (Traversed Edges Per Second) and problem s
ize that can be solved on the system.
In this presentation, 1 GTEPS = 1,000,000,000 TEPS.
Problem Size: SCALE
# of Vertices 2
SCALE, # of Edges 2
SCALE+4Graph500 (http://www.graph500.org)
Rank Machine Owner Problem Size GTEPS
1 NNSA/SC Blue Gene/Q Prototype II (4096 nodes / 65,536 cores )
NNSA and IBM
Research, T.J. Watson 32 254.349
2 Lomonosov (4096 nodes / 32,768
cores) Moscow State
University 37 103.251
3 TSUBAME (2732 processors / 1366 nodes / 16,392 CPU cores)
GSIC Center, Tokyo Institute of
Technology 36 100.366
4 Jugene (65,536 nodes) Forschungszentrum
Jülich 37 92.8769
5 Intrepid (32,768 nodes / 131,072
cores) ANL 35 78.8699
Nov 2011 List Nov 2011 List
Outline
Background
Breadth-First Search and Distributed Algorithm
2D Partitioning based Breadth-First Search
Optimizations
Performance Evaluation
Conclusions
Breadth-First Search
Outputs for each vertex:
Predecessor – a parent vertex in BFS tree
Depth – distance from source vertex
Level Synchronized BFS [David ICPP‘06]
Base algorithm of all reference implementation.
“Level Synchronized” means all vertices of a particular level are inspected before the n ext level.
S
B
C E
F
G H D
S
B C D
E F
BFS tree
BFS tree
Graph
Graph
Source vertex Source vertex
Level 0 Level 1 Level 2
Depth Depth
Two queues are employed for level synchronized BFS : Current Queue (CQ) and N ext Queue (NQ).
We start BFS by putting source vertex into CQ.
Level Synchronized BFS
Demonstration (1/5)
CQ
NQ
S
S
BFS tree BFS tree
Level 0 S
B
C E
F
G H D
Neighbor vertices of the vertices in CQ a re queued into
NQ and added to the BFS tree.
Level Synchronized BFS
Demonstration (2/5)
S
B C D
BFS tree BFS tree
Level 0 Level 1 S
B
C E
F
G H D
CQ
NQ
B C DS
Level Synchronized BFS
Demonstration (3/5)
S
B C D
E F
BFS Tree BFS Tree
After NQ and CQ are swapped, the next level is proc
essed.
Level 0 Level 1 Level 2 S
B
C E
F
G H D
CQ
NQ
F EB C D
Duplicate Request
Level Synchronized BFS
Demonstration (4/5)
S
B C D
E F
G H
BFS tree BFS tree
Level 0 Level 1 Level 2 Level 3 S
B
C E
F
G H D
CQ
NQ
G HE F
If NQ is empty, we finish BFS.
Level Synchronized BFS
Demonstration (5/5)
S
B C D
E F
G H
BFS tree BFS tree
Level 0 Level 1 Level 2 Level 3 S
B
C E
F
G H D
CQ
NQ
G H
Level Synchronized BFS can be solved by representi
ng it as Sparse Matrix Vector multiplication (SpMV) f
or distributed computing.
Demonstration of processing the level 2 with the pre
vious example is illustrated as follows.
Representation as Sparse Matrix Vector Multiplic
ation (SpMV)
S
B
C E
F
G H
D x
x
CQ NQ
Illustration of
Level 2 with
BFS search
Illustration of
Level 2 with
BFS search
Representation as SpMV
- Demonstration (1/4)
CQ
●
●
● S B C D E F G H
S ● ● ●
B ● ●
C ● ● ● ●
D ● ●
E ● ● ● ●
F ● ● ●
G ● ●
H ● ●
×
NQ
●
● VISITED
●
●
●
-
●Adjacency Matrix
(Sparse Matrix)
Example: Processing level 2 Example: Processing level 2
→
S
B
C E
F
G H
D x
x
CQ
Current Queue
NQ(Vector)
S B C D E F G H
Representation as SpMV
- Demonstration (2/4)
S B C D E F G H S ● ● ●
B ● ●
C ● ● ● ●
D ● ●
E ● ● ● ●
F ● ● ●
G ● ●
CQ ● ● ●
“Edge Frontier” is all edges emanating from the vertices of CQ.
Edge frontier is obtained by multiplying the adjacency matrix with CQ.
Edge Frontier Edge Frontier
x x
CQ NQ
Edge Frontier S
B
C E
F
G H D
× -
VISITED
●
●
●
●
S B C D E F G
Representation as SpMV
- Demonstration (3/4)
NQ
●
● B C D
● ● ●
●
●
●
● ●
x x
CQ NQ
Edge Frontier S
B
C E
F
G H D
Then we get neighbor vertices.
NQ is computed by excluding
visited vertices (VISITED) from
neighbor vertices.
- →
VISITED
●
●
●
●
S B C D E F G H
Representation as SpMV
- Demonstration (4/4)
CQ
●
●
● S B C D E F G H
S ● ● ●
B ● ●
C ● ● ● ●
D ● ●
E ● ● ● ●
F ● ● ●
G ● ●
×
NQ
●
● VISITED
●
●
●
-
●Adjacency Matrix
(Sparse Matrix)
Example: Processing level 2 Example: Processing level 2
→
S
B
C E
F
G H
D x
x
CQ
Current Queue
NQ(Vector)
S B C D E F G
Partitioning Methods for BFS
1D Partitioning BFS
Graph500 reference implementation
s use 1D partitioning.
All-to-all communication is required t
o compute BFS.
All-to-all communication is not scalable
on large distributed environment.
2D Partitioning BFS [Andy, SC’05]
Scalable partitioning method for larg
e distributed environment.
We used 2D partitioning.
[Andy, SC’05] Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-
1D Partitioning
2D Partitioning
Outline
Background
Breadth-First Search and Distributed Algorithm
2D Partitioning based Breadth-First Search
Optimizations
Performance Evaluation
Conclusions
Processor Mesh
2D partitioned adjacency matrix
2D Partitioning [Andy, SC’05]
[Andy, SC’05] Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-
2D Partitioning for BFS [Andy, SC’05] has been proposed to avoid all
-to-all communication.
Assume that we have 4 processors, the proposed method introduces
the notion of “Processor Mesh” that deploys the processors in 2D.
Adjacency matrix is partitioned into 4×2 blocks and assignment to ea
ch processor is the following manner.
2D Partitioning
- Assigning each vertices block to each processor
CQ
●
●
● S B C D E F G H
S ● ● ●
B ● ●
C ● ● ● ●
D ● ●
E ● ● ● ●
F ● ● ●
G ● ●
×
NQ
●
●
→
Processors
S B C D E F G
With 2D Partitioning, 4 data structures
including adjacency matrix, CQ, VISITED and
NQ are partitioned and assigned to each
processor in the following manner
VISITED
●
●
●
-
●2D Partitioning
- Obtaining Edge Frontier
×
Processors S B C D E F G H
S ● ● ●
B ● ●
C ● ● ● ●
D ● ●
E ● ● ● ●
F ● ● ●
G ● ●
H ● ●
Obtain edge frontier by multiplying adjacency (sparse) matrix and CQ vector
This requires each processor in a processor column of processor mesh to ex change CQ and have the same copy of CQ (called “Expand” described later)
CQ ● ● ●
S B C D E F G H VISITED
●
●
●
-
●2D Partitioning
- Excluding Visited vertices from edge frontier
VISITED
●
●
●
-
●Processors B C D
● ● ●
●
●
●
● ●
NQ
●
●
→
S B C D E F G
After identifying the edge frontier, NQ is obtained
by excluding visited vertices from edge frontier
This phase also requires the communication amo
ng processors in a processor row of processor me
sh
Communication of 2D Partitioning
There are 2 communication phases at each level.
First Phase: Expand
Each processor gathers CQ from the processors in a column of processor mesh
Second Phase: Fold
Processors exchange edge frontier in a row of processor mesh.
The number of nodes involved for communication becomes smaller
compared to 1D partitioning that requires all-to-all communication.
1D Partitioning 2D Partitioning
Expand
Expand FoldFold All to All
All to All
Outline
Background
Breadth-First Search and Distributed Algorithm
2D Partitioning based Breadth-First Search
Optimizations
Performance Evaluation
Conclusions
Our Proposed Approach
Our proposed approach is based upon 2D partitionin
g and further optimized with the following 3 methods
to achieve more high performance score
1.
Compressed Communication
2.
Parallelizing Sender and Receiver Processing
3.
Vertex Sorting for Improving Cache Hit Rate
Our Optimized Algorithm
Expand phase
Expand phase Fold phaseFold phase
Compressed Communication
Compressed
Communication in Fold Phase
Compressed
Communication in Fold Phase
Different approach from the original 2D partitionin
g
Original 2D partitioning [Andy, SC’05]
They use torus network.
Their method reduces duplicate requests on each processor in the Fol d operation and decreases total message volume.
However, reducing duplicate requests takes high computation cost.
We decided not to use this technique due to its high computatio
n cost. Therefore, we need to reduce communication data wit
h another method.
Torus Network Reducing Duplicate Request
Reducing Duplicate Request
Processor Duplicate Request
Compressed Communication
Source Destination
3
300
80
Original Vertex IDs
…
100,103, 303, 383, 123839,…
Difference VLQ (hexadecimal)
03
AC, 02
50
We introduce compressed communic
ation to reduce communication data.
By checking CQ in order of vertex ID,
the vertices in Edge Frontier will be s
orted in ascending-order.
We encode the difference from the
prior vertex with Variable-Length Q
uantity (VLQ) which is used by the st
andard MIDI file format and Google’s
protocol buffers etc.
This method can compress 64 bit valu
e to 8 bit if the value is small enough.
Parallelizing Sender and Receiver Processin
g
Parallelizing Sender and Receiver Processing Parallelizing Sender and
Receiver Processing
Parallelizing Sender and Receiver Processing
Approaches by Prior Arts [Aydlin, SC’11] : After all edge frontier is insp
ected and stored into memory, it is exchanged among the processors.
Inefficient: This requires large memory space. Communication and computati on can not be overlapped.
We divide the processing of the Fold phase into senders and receiver
s and parallelize them like stream processing.
We also use asynchronous communication in the Fold phase for eff
icient processing.
Sender Processing
extracts edges emanating from the vertices in CQ
Receiver Processing decodes edges and update BFS tree and NQ
Asynchronous Communication
Edge Frontier
Vertex Sorting for Improving Cache Hit Rate
Vertex Sorting to increase the locality of accessing to
VISITED
Vertex Sorting to increase the locality of accessing to
VISITED
Vertex Sorting for Improving Cache Hit Rate
Receiver processing generates irre
gular memory access to the VISITE
D, which wastes CPU time for mem
ory access waiting.
VISITED is represented as a bitmap.
The frequency of access to VISITED i
s determined by the vertex degree (th
e number of edges of each vertex).
We sort the vertices in decreasing
order of degrees to increase localit
y of memory access.
The degree distribution of Kronecker g
raph is followed by power law.
Sorting
Degree Distribution of Kronecker Graph Degree Distribution
of Kronecker Graph
VISITED bitmap
Implementation Detail
Programming Model: MPI + OpenMP
CQ, NQ, VISITED: Implemented using bitmap.
A vertex is represented using a bit of bitmap.
If a bit of bitmap is set, corresponding vertex is contained in the
queue. If not, it is not contained.
Expand
Single call of MPI_Allgather()
Adjacency matrix
Using CSR (Compressed Sparse Row) as a memory layo
ut.
・・・ 0001010 ・・・ bitmap
Outline
Background
Breadth-First Search and Distributed Algorithm
2D Partitioning based Breadth-First Search
Optimizations
Performance Evaluation
Conclusions
TSUBAME 2.0
Specification
CPU Intel Westmere EP (Xeon X5670, L2 Cache: 256 KB, L3: 12MB) 2.93 GHz, 6 CPU Cores (12 cores with Hyper Threading) x 2 sockets per 1 node (24 Threads)
RAM 54 GB
OS SUSE Linux Enterprise 11 (Linux kernel: 2.6.32)
# of Total Nodes 1442 nodes (We only used up to 512 nodes) Network Topology Full-Bisection Fat-Tree Topology
Network Voltaire / Mellanox Dual-rail QDR Infiniband (40Gbps x2 = 80 Gbps)
GPGPU Three NVIDIA Fermi M2050 GPUs (*Not used) Software
GCC GCC 4.3.4 (-O3 option)
Performance Evaluation
- Comparison with Reference Implementations
We compare our optimized method with reference implementations
(Simple, Rep-CSR, and Rep-CSC).
Reference implementations do not run on large scale due to errors.
Our optimized implementation is fast and scalable.
0 256 512 768 1024
0 20000000000 40000000000 60000000000 80000000000 100000000000 120000000000
11,280,800,000.0 21,345,100,000.0
37,248,500,000.0
63,487,300,000.0
99,000,000,000.0
# of nodes
TEPS (GTEPS)
0 32 64 96 128
0 5000000000 10000000000 15000000000 20000000000
25000000000 Optimized
Reference Simple Reference Rep- CSR
# of nodes
TEPS (GTEPS)
Performance Evaluation
- Effect of Optimizations
Compressed communication greatly boosts the performance.
Overall communication data volume with compressed comm
unication is 1/3 of the base version that does not adopt the co
mpression method.
Performance as TEPS per Node
Performance as TEPS
per Node Communication Data Volume
Communication Data Volume
20000000000 30000000000 40000000000 50000000000 60000000000
70000000000 Base
Compress Communication replicated (reference)
Aggregated Message Size per Node (GB)
0 50000000 100000000 150000000 200000000 250000000 300000000 350000000
Reference-Simple Base
Compressed Communication Compress, Vertex Sorting
TEPS per node (MTEPS)
Performance Evaluation
- Breakdown of Execution Time
There is communication wait time when the number of nodes is larger than 16.
Because communication and computation is completely overlapped in our impl ementation, the performance of our optimized implementation is limited b y communication bandwidth.
This is why vertex sorting does not show any effect with more than 16 nodes.
Performance breakdown of the overall execution time Performance breakdown of
the overall execution time
1 2 4 8 16 32 64 128 256 512 0
1 2 3 4 5 6 7 8 9
10 wait-synch wait-comm compute expand
Execution Time (seconds)
1 2 4 8 16 32 64 128 256 512 0
50000000 100000000 150000000 200000000 250000000 300000000
350000000 Reference-Simple Base
Compressed Communication Compress, Vertex Sorting
TEPS per node (MTEPS)
Performance as TEPS per Node
Performance as TEPS per Node
Related Work
BFS on Distributed Memory Systems
Andy Yoo, et. all, A Scalable Distributed Parallel Breadth-First Search Alg orithm on BlueGene/L. SC ‘05.
Aydin Buluc and Kamesh Madduri. 2011. Parallel breadth-first search on distributed memory systems. SC ‘11.
They compared the performance of 1D and 2D partitioning methods.
BFS on GPUs
Duane Merrill, Michael Garland, and Andrew Grimshaw. Scalable G PU graph traversal. PPoPP '12.
BFS on Multi-Socket CPU Systems
Jatin Chhugani et al. Fast and Efficient Graph Traversal Algorithm for CP Us : Maximizing Single-Node Efficiency. IPDPS 2012.
BFS on Shared Memory Systems
David A. Bader, Kamesh Madduri. Designing Multithreaded Algorithms for
Conclusions
We proposed an optimized implementation of the Graph500 bench
mark on a large-scale distributed memory environment.
Our optimized implementation is based on the level-synchronized
BFS with 2D partitioning and we proposed some optimization meth
ods such as communication compression and vertex sorting.
Our implementation can solve BFS of large-scale graph with 68.7
billion vertices and 1.1 trillion edges for 10.58 seconds with 1366
nodes and 16K CPU cores, which corresponds to 103.9 GTEPS.
Because the result of Graph500 is median TEPS among 64 BFS e
xecutions, our record is 100 GTEPS, which is ranked No.3 on Nov
2011 List.
Recent Results
We used GPGPU for Graph500 on TSUBAME2.0 and get the performanc e of 317 GTEPS, which is ranked No.4 on Jun 2012 List.