Towards a Distributed Large-Scale Dynamic Graph Data Store (Unrefereed Workshop Manuscript)

全文

(1)Vol.2016-HPC-153 No.8 2016/3/1. IPSJ SIG Technical Report. Towards a Distributed Large-Scale Dynamic Graph Data Store (Unrefereed Workshop Manuscript) Keita Iwabuchi1,2,a). Scott Sallinen†1 Roger Pearce2 Brian Van Essen2 Satoshi Matsuoka1. Maya Gokhale2. Abstract: In many graph applications, the structure of the graph changes dynamically over time and may require real. time analysis. However, constructing a large graph is expensive, and most studies for large graphs have not focused on dynamic graph data structure. To address the issue above, we have proposed DegAwareRHH, a high performance dynamic graph data store designed for scaling out to store large scale-free graphs by leveraging compact hash tables with high data locality. We extend DegAwareRHH for multiple processes and distributed-memory, and perform dynamic graph construction on large scale-free graphs using emerging ‘Big Data HPC’ systems such as the Catalyst cluster at LLNL. We demonstrate that DegAwareRHH processes a request stream 206.5x faster than a state-of-the-art sharedmemory dynamic graph processing framework, when both implementations use 24 threads/processes to construct a graph with 1 billion edge insertion requests and 54 million edge deletion requests. DegAwareRHH also achieves processing over 2 billion edge insertion requests per second using 128 compute nodes to construct a large-scale web graph, containing 128 billion edges, the largest opensource real graph dataset to our knowledge.. 1.. Introduction. In many graph applications, the structure of a graph changes dynamically over time and may require real time analysis. Large scale graph processing often presents challenging data intensive workloads, due to the unstructured memory access patterns common to many applications. Significant research into the analysis of challenging scale-free graphs that are static (graph topology remains fixed) has been leveraged to remarkable improvements in the processing capabilities using HPC systems [4], [8], [27]; much of this work has been enabled by the bi-annual Graph500 competition [1]. Research into the algorithms and management of data structures for large scale dynamic temporal graphs, that is, graphs where the topology changes as a function of time, is at the moment immature when compared to the static scenario. While capabilities for processing large scale dynamic graphs are lacking, the need for such capabilities is strong. In this work, we present our progress towards developing the data structures and infrastructure management necessary to support dynamic graph analysis at large scale, on distributed High Performance Computing (HPC) platforms. Even with the large memory capacities of HPC systems, many graph applications require additional out-of-core memory. Such large memory requirements can be due to large scale topology, as is the case with the 128 billion edge web graph we study, or 1 2 †1 a). Tokyo Institute of Technology Lawrence Livermore National Laboratory Presently with University of British Columbia [email protected]. c 2016 Information Processing Society of Japan. rich metadata that decorates the topology. For example, Facebook manages 1.39 billion active users as of 2014, with more than 400 billion edges [9]. Due to the growth of the Internet of Things (IoT), the volume and velocity of data such as Facebook’s is expected to increase dramatically. To accommodate such large graphs we use node-local NVRAM to extend main memory, and persistently store the graph data. Node-local NVRAM (e.g., NAND Flash) has found its way into the HPC platform, driven in large part by application checkpointing needs. In this work, we present a dynamic graph data store (DegAwareRHH),which adopts open addressing compact hash tables that exhibit high data locality properties, in order to minimize the number of accesses to NVRAM. DegAwareRHH is degree aware, and uses separated compact data structures for low-degree vertices to reduce their storage and search overheads. We implemented DegAwareRHH for distributed-memory using an asynchronous visitor queue abstraction [22] [23], and perform scaling studies of large scale graph construction on up to 128 compute nodes. Summary of our contributions: • We demonstrate that DegAwareRHH can construct a graph with 1 billion edge insertion requests and 54 million edge deletion requests 206.5 times faster than a state-of-the-art shared-memory graph processing framework, when both implementations use 24 threads/processes. • We present scaling studies of constructing large-scale, synthetic graphs and a large, real-world graph (up to 275 billion edges insertion requests), scaling up to 128 compute nodes.. 1.

(2) Vol.2016-HPC-153 No.8 2016/3/1. IPSJ SIG Technical Report. 2.. Background. 2.1 Scale-free Graph Many real-world graphs can be classified as scale-free, where the distribution of vertex degrees follows a scale-free power-law distribution [2]. The degree of a vertex is the count of the number of edges connecting to the vertex. A power-law vertex degree distribution means that the majority of vertices will have a lowdegree, while a select few will have a very large degree, with degree distribution following a power-law pattern. Recent work has focused on optimizations related to the challenge of highdegree vertices which cause load imbalance for parallel computations [22] [13] [23]. In this work, we identify low-degree vertices which are particularly challenging for graph data stores in NVRAM. 2.2 NVRAM In the next generation of supercomputers, providing sufficient main memory capacity is one of the biggest challenges due to the high power consumption of DRAM. NVRAM will greatly expand the possibility of processing extremely large-scale graphs that exceed the DRAM capacity of the nodes. However, graph construction cost using NVRAM is extremely high compared to DRAM: a naive data structure implementation would cause significant performance degradation due to unstructured memory accesses to disk. 2.3 Dynamic Graph Processing In many real-world graph applications, the structure of the graph changes dynamically over time and may require real time analysis. For example, genomic sequence assembly leveraging de Bruijn graphs [10] [28] incrementally build graphs during sequence assembly. These applications construct a graph based on relatively short fragments of DNA from a sequencer. When a new sequence is generated, graph analysis and graph construction/updates are conducted simultaneously. Therefore, not only is high performance dynamic graph construction required, but also efficient data retrieval during graph analysis. 2.4 Graph Data Structure Graph processing is a highly data-intensive problem, abd thus improving the data locality of the graph data structure storage is an essential optimization [20]. Over the years, many data structure models have been studied. In this section, we discuss classic graph data structure models for static graph and dynamic graphs and their advantages and disadvantages. In the following context, we consider a graph G which has n vertices and m edges. 2.4.1 Static Graph Data Structure Adjacency-matrix – The simplest data structure model is an adjacency-matrix. The size of adjacency-matrix is n × n matrix. If there is a edge from vi to v j , an entry ai j holds a positive value or the property data of the edge. An adjacency matrix consumes O(n2 ) memory, and this is a huge disadvantage when processing. c 2016 Information Processing Society of Japan. Neighbor Vertices 1. 0. 2. Source Vertices. • We show DegAwareRHH processes over 2 billion edge insertion requests per second, using 128 compute nodes on the large-scale real graph with 128 billion edges.. 3. 0. 1. 2. 1. 0. 3. 2. 0. 3. 0. 3. 1. CSR data structure Source Vertex ID Index array. 0. 1. 2. 3. 0. 3. 5. 6. 7. Edge array 1. 2. 3. 0. 3. 0. 0. 1. Index. 1. 2. 3. 4. 5. 6. 7. 0. Fig. 1 CSR Graph Data structure. sparse graph. Compressed Sparse Row (CSR) – CSR data structure is the de-facto graph data structure widely used in various graph processing implementations. The CSR data structure consists of two array structures called an index array and an edge array. The index array holds indices of the edge array, and the edge array holds neighbor vertices’ ID. More specifically, each index in the index array represents a source vertex’s ID, and the corresponding element in the index array refers to an index in the edge array. The range of a vertex’s edges in the edge array is from index[vi ] to index[vi + 1]. The memory size of the CSR data structure is O(n + 1) for the index array and O(m) for the edge array. An overview of the CSR data structure is illustrated in Figure 1. The CSR data structure can provide high data localities and efficient memory usage due to its packed array structure. However, because of packing, the CSR data structure has huge drawbacks on storing dynamic graphs. In general, even when adding or deleting a single edge or vertex, the CSR data structure requires updates to the entire array, causing large data movements. 2.4.2 Dynamic Graph Data Structure Key-Value – The Key-Value Store (KVS) model, in other words a map or dictionary data structure, manages an element as pair of a key and value. The simple method using a KVS model is to store an edge using a pair of source and target vertex as the key and store its property data as the value. To store vertex data, in addition to the edge table, it holds a vertex table consiting of pair of a vertex and vertex property data. However, while such a simple Key-Value Store model can insert and delete a vertex and edge efficiently, there is no consideration of graph topologies. Therefore, it is difficult to obtain locality benefits among edges adjacent to a vertex, which is a highly important factor to determine the performance of many graph analyses. Adjacency-list – The basic components of an adjacency-list are a vertex-table and an edge-list. A vertex-table describes a set of all vertices of a graph. Each element of a vertex-table consists of a pointer pointing to an edge-list and also vertex property data if needed. An edge-list holds the list of neighbors of its vertex, with edge property data. The main advantage of adjacency-list model is it can provide high data locality among out-going edges (or in-coming, depends on user configurations) of each vertex. There are many variations on this model, each with its advan2.

(3) Vol.2016-HPC-153 No.8 2016/3/1. IPSJ SIG Technical Report. tages and disadvantages. For the vertex-table, typically a tree or hash (key-value) data structure is used, and for the edge-list, a single vector or linked-list data structure is used. A tree data structure is widely used in database systems; however, a tree data will structure potentially cause random memory accesses for each operation (such as, insert, find and delete). Even though the time complexity of a tree data structure is O(log(n)), this is too costly for large-scale graphs stored in NVRAM; therefore, we use nearconstant time hashtables.. 3.. 3.1 Insert The core features and characteristics of open addressing hash table is how to deal with hash collisions. The main idea of alternatives in case of a collision of Robin Hood Hashing is performing linear probing to the cyclized probe sequence H(k), H(k) + 1, ..., L − 1, 0, 1, ..., H(k) − 1, where H(k) is hash value of k and L is the length of the table. The length of the probe sequence is called the probe distance. Specifically, RHH tries to move an existing element if a new element hashes into its position, and the probe distance of the existing element is equal or smaller than that of the new element. This strategy attempts to keep the average probe distance of all elements in the table small. An illustration of edge insertion using Robin Hood Hashing is shown in Figure 2. Let {i, j} be an edge between a vertex i and j. In this example, a hash function is h = vid mod C where vid is a ID of a vertex and C is the capacity of a hash table. First, compute a hash value (initial position) of the new edge {1,5} using the hash function. Second, edge {1,0} is moved to the next position since the probe distance of {1,0} is equal or smaller than the one of edge {1,5}. Next, edge {1,2} is moved to the next position and edge {1,0} is inserted in the position (index 2) because probe distance of edge {1,2} is smaller than the one of edge {1,0}. Ideally, the target element is located in a same memory page, although the element is moved to another position because of hash conflict. 3.2 Delete When deleting an element, instead of simply erasing all data. c 2016 Information Processing Society of Japan. 1. 2. 3. 4. {0,-1} {1,-0} {1,-2}. {1,-5}. 00. 10. 11. 0. 1. 2. 5. 6. 7. {5,-6} {6,-5} 50. 60. 5. 6. 1--0 Hash value. Probe distance. 3. {0,-1} {1,-5} {1,-0} {1,-2} 00. 10. 11. page-0. Robin Hood Hashing (RHH). In this section we describe a hash table algorithm which is a key structure of our DegAwareRHH. In order to minimize the number of accesses to NVRAM, resulting in page misses, we choose Robin Hood Hashing [5], because of its locality properties and compactness. Robin Hood Hashing is an open address hashing (without collision chains) and designed to maintain a small average probe distance – the distance between initial (hashed) position and current position of a key. Average probe distance is an important factor to determine the construction cost of the table. We expect that the combination of low probe distance and sequential memory accesses patterns of Robin Hood Hashing will be beneficial for out-of-core processing using NVRAM. Previous work has improved RHH’s performance on various workloads, and also shown that the probe distance can remain small under various load conditions [19] [11]. Now, we describe how Robin Hood Hashing behaves during insert, delete and search operations.. 0. Key. 12. 4. 7. {5,-6} {6,-5} 50. 60. page-1. Fig. 2 Edge insert into Robin Hood Hashing table. (probe distance, key and value) of the element and moving succeeding elements forward, we make the element as deleted by setting a tombstone flag, keeping its existing probe distance. By keeping probe distances of deleted elements, it can be expected to reduce the cost of moving elements when inserting or deleting an element. The main drawback of this approach occurs after many delete operations have been done. After many elements are deleted, an average prove distance will be large, although the number of actual elements is small. In a worst-case, the probe distance may exceed the capacity of a table. To prevent this situation, we periodically re-hash a whole element of a table; erasing tombstones of deleted elements and reducing probe distances of active elements. 3.3 Search A search operation is straight forward: compute a hash value of a target element and probe the target element linearly from the hash value position. If an empty space (not including tombstones), or an element which has larger probe distance than the target is found, fault the search since the target element does not exist in the table.. 4.. DegAwareRHH. In this section, first, we describe the design of DegAwareRHH. Second, we describe its insertion algorithm. Last, we give some optimization techniques that improve its performance. 4.1 Overview DegAwareRHH is designed to target graphs that have following characteristics: 1) extremely-large-scale, scale-free graphs that have at least hundreds of billions ∼ trillions of edges; 2) each vertex and edge has property data or metadata; 3) there are multiple edges between two vertices, called a multigraph. For this type of graph, the key challenges of designing a dynamic graph data structure are: 1) supporting out-of-core memory in order to process large-scale graphs; 2) high performance insertion and deletion of vertices and edges; 3) quickly locating a specific edge matching topological and metadata constraints; 4) providing high locality among out-going edges of a vertex. 4.2 Graph Data Structure Using Robin Hood Hashing DegAwareRHH is based on the adjacency-list data structure model using Robin Hood Hashing. Additionally, it is designed to manage scale-free graphs (with many low degree and a few 3.

(4) Vol.2016-HPC-153 No.8 2016/3/1. IPSJ SIG Technical Report. Insert Edge (u,v). Low-degree table Mid-high-degree table {v1, v2}. v2 p2 w6. w1 v1 p1. w7 w8. w2. p1. w5. w4. {v1, v3}. w1. {v2}. {v4}. p2. p4. d1⟵ L_TBL.degree(u). w2. v4 p4. v3 p3. True. {v4} {v4}. {v1}. w5. w4. w6. Fig. 3. d2 = 0 L_TBL.insert(u,v) True False MH_TBL.insert(L_TBL.pop(u)) L_TBL.insert(u,v) MH_TBL.insert(u,v) MH_TBL.insert(u,v). w8. Overview of NVRAM Specialized Degree Aware Dynamic Graph Data Structure. very high degree vertices), by using two different data structures, depending on the vertex type. 4.2.1 Middle-high-degree Table Locating a specific edge from high degree vertices may be highly costly. Therefore, to quickly locate a specific edge matching topological constraints, we use a hash table for the edge-list instead of using a simple 1D array. This data structure efficiently handles insertion while maintaining locality of access. 4.2.2 Low-degree Table Even though we use the compact hash table, there are some relatively high overhead operations for low-degree vertices, such as allocating a new table and pointer accesses to edge-lists. Accordingly, we use single Robin Hood Hashing table to store lowdegree vertices in order to reduce the costs that are relatively high for low-degree vertices. 4.3 Data Layout An illustration of DegAwareRHH is shown in Figure 3. Depending on the degree of a source vertex, edges are stored in two types of tables: a low-degree table and a middle-high-degree table. The low-degree table stores edges in a single compact table, i.e., directly using Robin Hood Hashing. The middle-high-degree table consists of a vertex table and edge-chunks. The vertex table stores source vertices’ information, i.e., source vertex’s ID and vertex property data if needed. Adjacency edges are stored into edge-chunks and the vertex table holds pointers to edge-chunks. In order to locate a specific edge with a constant time, we also use Robin Hood Hashing for edge-chunks. In our current implementation, only 1 byte of extra space is allocated for each element to construct a Robin Hood Hashing table (7 bits for a probe distance, and 1 bit for a tombstone flag). 4.4 Programming API Like existing popular graph processing frameworks, DegAwareRHH provides basic APIs including: graph construction and update, such as add edge, delete edge and update property data; graph reading APIs, such as get property data and get adjacencylist. The get adjacency-list function return a forward iterator pointing to the adjacency-list. Using C++ templates, our implementation can hold any type of object for key, value and property data. 4.5 Insertion Algorithm An illustration of an edge insertion algorithm of DegAwareRHH is shown in Figure 4. Let (u, v) be an edge to insert where. c 2016 Information Processing Society of Japan. False d2 ⟵ MH_TBL.dgree(u). d1 < low_degree_threshold True False. {v3} {v3} w7. d1 > 0. u: source vertex ID v: target vertex ID L_TBL: low-‐degree table MH_TBL: middle-‐high-‐degree table. max_probedistance < long_probedistance True False Allocate a chain table Finish. Fig. 4 Degree Aware Edge Insertion Algorithm. u is a source vertex ID and v is a target vertex ID. Step 1. query current degree (the number of edges) of vertex u to the low-degree table. Step 2-A. If the degree is more than 0 and less than the lowdegree threshold of the middle-high-degree table, insert the edge into the low-degree table. If the low-degree table already has edges as many as the low-degree threshold, move all edges to the middle-high-degree table. Step 2-B. Otherwise, query current degree of vertex u to the middle-high-degree table. If the degree is 0, that is, vertex u is new vertex, then insert into the low-degree table. Otherwise insert the edge into the middle-high-degree table. Step 3. While inserting an element, if a probe distance of an element exceeds a threshold due to too many hash conflits, the table is grown to double in size. It is well known that a probe distance usually keeps small constant number, and it will rarely happen that a probe distance becomes long. Thus, we believe that the total memory usages won’t be increased dynamically despite growing the table in long probe distance situations. 4.6 Optimization Load Factor: It has been reported that when the table is close to full that it degrades its performance rapidly [5]. To prevent this situation, we double table capacity when the number of elements, not including tombstones, exceeds 90% of its capacity. Rehashing Table: As described in section 3, after a lot of elements are deleted and inserted, probe distance would become quite large. Thus, we check the probe distance each time after insertion is completed, and rehash all elements if the value is larger than the table size. Memory Pool Allocator: Allocating a small size of memory space is an expensive operation. We use the memory pool allocation technique to reduce the cost of individual, small sized memory allocations.. 5.. Extending DegAwareRHH for DistributedMemory. Distributed memory DegAwareRHH has been implemented with MPI to store a distributed graph. We adopt the distributed asynchronous visitor queue[22][23] to be the driver of DegAwar4.

(5) IPSJ SIG Technical Report. eRHH. The visitor queue framework provides the parallelism, and creates a data-driven flow of computation over MPI with asynchronous communication. All visitors are asynchronously transmitted, scheduled, and executed. An illustration of a distributed dynamic graph construction algorithm is shown in Figure 5. Note that, in Figure 5, we only show the communication from Process A to Process B to make the figure concise. Before constructing the graph, we first need to determine an allocation strategy for vertices – i.e., which process is responsible for a given vertex. To determine the owner process of a new edge request eid , consisting of an operation and a source, destination pair, we use Consistent Hashing. A vertices’ owner process is computed as follows: hash(eid .source) mod P, where P is the number of processes, and all processes use the same hash function. Due to this strategy, any process can determine in constant time the owner of a given vertex, and thus the process responsible for the target location of any edge. Each process independently reads graph construction requests (insertion/deletion) and passes them to the visitor queue sequentially. The visitor queue applies the request immediately into the local graph store if it is the owner of the source vertex; however, if the owner of a vertex is a remote process, it pushs the request into its local message queue (wrapped in a visitor object), and send the visitors using asynchronous communication when the queue becomes full. Each process checks its receive queue periodically, and executes received visitors as a visit to the local store, i.e., it applies the graph construction requests into the local process.. 6.. Experiments. In this section, we experimentally evaluate the graph construction performance of our dynamic graph data store (DegAwareRHH). 6.1 Experimental Setup 6.1.1 Implementation We implemented our DegAwareRHH in C++, and used the Boost.Interprocess library to allocate in a memory mapped region. Specifically, we use the Boost.Interprocess memory pool allocator to allocate the hash tables to reduce the cost of many small sized memory allocations. We use memory mapped files as an interface to the NVRAM. For in-core experiments, we create files under /dev/shm to only use DRAM. Based on a preliminary experiment, we set middle-high-degree threshold at 2, that is, vertices with 2 or more edges are stored in the middle-high-degree table. For experimental comparison, we show the performance of the following implementations: • Baseline model – The model consists of a vertex table and edge tables: Vertex table holds source vertices’ ID, property data, and a pointer pointing to an edge table using Boost unordered map container; Edge table holds target vertices’ ID and edges’ property data, using Boost vector container. Boost unordered map consists of multiple buckets where each one is chunk of any number of elements. When deleting a element in a edge table, instead of using Boost vector’s. c 2016 Information Processing Society of Japan. Vol.2016-HPC-153 No.8 2016/3/1. erase function naively, we delete the element by swapping with the last element in the edge table to avoid moving succeeding elements forward after each deletion. • STINGER[12] – STINGER is a shared memory (in-core) parallel dynamic graph processing framework, and its core data structure was developed at the Georgia Institute of Technology. STINGER can update a graph with several times ∼ three orders of magnitude better performance in comparison with state-of-the-art 12 open source graph databases and libraries [18], including SQLite, Neo4j, Giraph, DEX and Boost Graph Library. We used version 06.15. To perform a fair comparison, we compiled the implementations using GCC-4.9 with -O3 optimization option. For Baseline model and DegAwareRHH, we used Boost 1.59.0 and MVAPICH2. 6.1.2 Datasets We show experiments using both synthetic graph models and a real-world graph. • RMAT – Generates scale-free graphs [6], and we follow the Graph500 V1.2 specifications for generator parameters [1]. After graph generation, all vertex labels are uniformly permuted to destroy any locality artifacts from the generators. The graph generator generates a graph as an undirected graph in random order. For a edge (ei , e j ), we also generated the opposite direction of edge (e j , ei ). • WebGraph2012 (hyperlink graph) – We also use a large web graph[3], the largest open source real graph dataset to our knowledge, that has 128 billion edges (as a directed graph). Each vertex corresponds to a web page and each edge is a hyperlink. Each vertex is represented as a 64-bit integer value, thus a edge is a pair of 64-bit integers. In this experiment, we set vertex and edge property data as dummy value (unsigned char, 0), the datasets above do not have property data. To evaluate the performance of the delete operation, we add 5% ∼ 50% of edge deletion requests into edge insertion requests with random order, but with a carefully chosen position such that the insertion request of a edge comes before its deletion request. The datasets are stored in files as a pair of source and vertex IDs. We use source vertex’s ID as the key of the lowdegree table and middle-high-degree table, and target vertex’s ID as the key of the edge-chunk. 6.1.3 Machine Configurations We use the Catalyst cluster at Lawrence Livermore National Laboratory. Catalyst has 12-core Intel(R) Xeon(R) E5-2695v2 processors (2 sockets) and 128 GB of DRAM and is equipped with a node-local PCI-based 800 GB of NAND flash NVRAM. The nodes are connected over dual rail QDR-80 Intel TrueScale fabrics. 6.1.4 Experiment Method In this paper, we evaluate our DegAwareRHH in the context of graph construction performance. We repeated the following steps: • Step 1) each process buffers a subset of edges (1 million) from the file system into DRAM to avoid measuring the cost of reading edges from files. • Step 2) insert or delete edges from the edge buffer into the 5.

(6) Vol.2016-HPC-153 No.8 2016/3/1. IPSJ SIG Technical Report. Process A. Process B Local graph store (DegAwareRHH). Local graph store (DegAwareRHH). Apply to local. Apply to local. Streaming graph construction requests. Graph partitioning function. Message queue. Async comm when queue become full. Message queue Visitor Queue. Push into the local queue. 6.2 Single Node Experiments First, we evaluate graph construction performance of our DegAwareRHH against STINGER and Baseline model on a single node. We show graph construction performance using a RMAT graph on Baseline model, STINGER and our DegAwareRHH in Figure 6. We measure performance for 6, 12, and 24 threads (STINGER) or processes (Baseline & DegAwareRHH). The dataset has 1 billion edge insertion requests and 54 million edge deletion requests. The x axis denotes the number of processed graph construction (edge insertion/deletion) requests in millions. The y axis denotes the cumulated execution time in seconds. All implementations scale with increasing the number of threads/processes from 6 to 24, by 1.9 times on STINGER and Baseline model and by 3.7 times on DegAwareRHH. As the size of a constructed graph increases, the execution time of each chunk step on STINGER and Baseline model increases progressively. On the other hand, our DegAwareRHH results in near-liner scaling and can insert over 1 billion edges in 50 seconds with 24 processes. Our DegAwareRHH outperforms the other implementations, specifically, 206.5 times faster than STINGER and 23.5 times faster than Baseline model with 24 threads/processes. The differences of the results come from differences of the data structures. STINGER stores a graph using adjacency-list model, like our DegAwareRHH; however, it stores edges into a linkedblock-list (each linked-block has multiple edges) data structure, and it require a sequential search to find a target edge or an empty space for a new edge. Thus, the time to insert an edge (i, j) increases as the out-degree of i increases. The Baseline model stores edges into a vector container, and it also requires sequential access to find a target edge; however, different from STINGER, a new edge is always inserted into the last position of the array. Due to this, it performs better than STINGER. On the other hand, as our DegAwareRHH uses hash table for edge tables, the execution time can be constant even though the number of inserted edges increases. Second, we evaluate graph construction performance while varying the number of edge deletion requests on the three im-. c 2016 Information Processing Society of Japan. STINGER (th=6) STINGER (th=12) STINGER (th=24). 105. Baseline (th=6) Baseline (th=12) Baseline (th=24). DegAwareRHH (th=6) DegAwareRHH (th=12) DegAwareRHH (th=24). 104 103 102 101 100. 102. The Number of Applied Edges Requests (million). 103. Fig. 6 Unique Edge Insertion and 5% of Edge Deletion (RMAT 25, in-core, single-node); lower is better.. The Number of Processed Requests per Second (edge insertion and deletion). graph data store sequentially. All edges are uniquely inserted, that is, all insertion operations are performed following a find operation. We only measured the execution time of step 2.. Cumulated Execution Time (sec.). Fig. 5 Distributed Dynamic Graph Construction over the Visitor Program. 108. STINGER. Baseline. DegAwareRHH. 107. 106. 105. Fig. 7. 5. 10. 20 30 Added Deletion Rate. 40. 50. Unique Edge Insertion and Deletion (RMAT 25, in-core, singlenode). Higher is better; note log y-axis scale.. plementations with 24 threads/processes. The results are shown in Figure 7. We changed the rate of added edge deletion requests as 5, 10, 20, 30, 40 and 50. The x axis denotes the rate, and the y axis denotes the number of processed graph construction requests per second in log scale. As the number of deletion requests increases, the performance of STINGER and Baseline model increases since the cost of finding a edge decreases due to the size of edge table becoming smaller. In contrast, DegAwareRHH slow downs slightly as the number of deletion requests increase (by 5% when comparing 5% and 50% rate). Nevertheless, DegAwareRHH still outperforms the other implementations, e.g., 121x faster than STINGER and 16x faster than Baseline model on the 50% of edge deletion requests case. 6.

(7) IPSJ SIG Technical Report. Fig. 8. In-core unique edge insertion (24 processes per node, 4.3 billion edges per node).. 6.3 Multiple Node Experiments We evaluate graph construction performance of our DegAwareRHH against Baseline model on multiple nodes. Note that we don’t use STINGER in this experiment, since it only supports a shared memory environment. 6.3.1 Weak Scaling To explore how our DegAwareRHH scales when multiple processes run on multiple compute nodes, we first perform a weak scaling experiment which increases the number of compute nodes while fixing the size of the sub-graph per node. Each compute node constructs a subgraph which has 134 million vertices and over 2 billion undirected edges, thus, the actual number of inserted edges is over 4 billion. We increase the number of compute nodes up to 64 with 24 processes per node, therefore, at 64 nodes, the graph has 86 billion vertices and 275 billion edges that are inserted. In this experiment, we also evaluate performance of nonunique edge insertions since Baseline model has a huge overhead to find an edge to uniquely insert it. We perform four experiment scenarios: unique edge insertion (Figure 8); non-unique edge insertion (Figure 9); unique edge insertion and deletion (Figure 10); non-unique edge insertion and deletion (Figure 11). We added 5% of deletion requests in the deletion scenarios. In the 4th scenario, only a single edge is deleted per a deletion requet. Note that except the non-unique insertion workload, we halted the experiments on Baseline at more than 8 compute nodes before finishing due to excessive run times. First of all, DegAwareRHH scales as the number of compute nodes increases and outperforms Baseline on the 3 scenarios; it achieves a processing of 800 million requests per second at 64 nodes the unique insertion workload for instance. On the non-unique insertion workload (Figure 9), Baseline model overperforms DegAwareRHH by 29.8% at 64 nodes. This result can be attributed to the edge insertion algorithm of Baseline model that simply inserts a edge at the last position of an edge vector without checking for existence of the edge. On the non-unique insertion and deletion workload (Figure 11), even though edges are inserted without checking for existence, Baseline model slows down significantly when only. c 2016 Information Processing Society of Japan . Vol.2016-HPC-153 No.8 2016/3/1. Fig. 9. In-core non-unique edge insertion (24 processes per node, 4.3 billion edges per node).. Fig. 10. In-core unique edge insertion and 5% of deletion (24 processes per node, 4.3 billion edges per node).. Fig. 11. In-core non-unique edge insertion and 5% of deletion (24 processes per node, 4.3 billion edges per node).. adding 5% of deletion requests. 6.3.2 Strong Scaling To evaluate the performance of strong scaling on a realistic workload, we perform graph construction experiments on a large7.

(8) Vol.2016-HPC-153 No.8 2016/3/1. IPSJ SIG Technical Report. Baseline DegAwareRHH The Number of Processed Requests per Second. 350 300 Execution Time (sec.). 108. 250 200. 100. DegAwareRHH. 107 106 105 104. 150. Baseline. In-core sorted. Out-of-core sorted. In-core random. Out-of-core random. Fig. 13 Out-of-core unique edge insertion (higher is better, log y-axis).. 50 0. 0. 20. 40 60 80 100 120 140 The number of nodes. Fig. 12 In-core unique edge insertion (strong scaling).. scale real-world graph (WebGraph2012). Both implementations can hold the graph on 32 nodes at a minimum; we increased the number of nodes as 32, 64 and 128. Both implementations scale with increasing the number of nodes, up to 4.1 times on Baseline model and 4.3 times on DegAwareRHH. As the degree of vertices in the graph is smaller than that of the RMAT graphs, the performance gaps between Baseline model and DegAwareRHH shrink. DegAwareRHH outperforms Baseline model by 44.29% at 128 nodes. In addition, we found that the vertex tables of both implementations become small with increasing the number of processes, improving performance due to greater data locality. 6.3.3 Out-of-core Experiments Finally, we perform an experimental evaluation of out-of-core graph construction performance. We set up the experimental evaluation by reordering the edge stream into a sorted order, and we regard this as the best performance scenario because the edges from a same vertex are inserted continuously into a same edgelist and it contributes to data localities. In contrast, we regard the random ordered datasets as a worse performance scenario. We construct RMAT graphs with 4.3 billion edge insertions for the in-core scenario, and 17 billion edge insertions for the out-ofcore scenario. Because out-of-core experiments with the Baseline model require excessive run time, we show estimated performances based on a performance after it exceeds DRAM size but before finishing. Note that the Baseline model slows down with increasing the size of construction graph; therefore, the estimated performances denote better performance than its actual performance. We show the results in Figure 13. In sorted-order scenarios, Baseline model slow down by 58.8% when compare the in-core scenario against to the out-of-core. On the other hand, out-of-core DegAwareRHH can process a 4x larger graph than the in-core scenario with only a 39% performance degradation and keep high graph construction performances over 27 million processing per second. In randomized-order scenarios, although DegAwareRHH slows down considerably, it still achieves processing 0.2 million edge insertion requests.. c 2016 Information Processing Society of Japan. 7.. Related Work. Many dynamic graph data structures have been proposed based on a tree data structure [26]. However, a tree data structure takes O(log(n)) operations and tends to incur more random accesses than Robin Hood Hashing, a critical matter in large-scale and/or out-of-core graph processing. Furthermore, several studies have been conducted on out-of-core large-scale static graph processing [21] [17] [24] [16]. However, little is known about large-scale and/or out-of-core dynamic graph data storage. Several studies have reported that a graph database is slow and has huge memory footprint [18] [14]. A key-value store is a popular database model and is designed to easily scale to a very large size [7]. However, since key-value store doesn’t consider the topology of a graph, the locality properties on graph analysis workloads are low. Another approach is to accelerate I/O performance itself: maximizing random I/O performance by using multithreading [15], to effectively utilize storage bandwidth by conducting sequence accesses called Parallel Sliding Windows (PSW) method [17], or by using an edge-centric rather than a vertex-centric computation model [24]; combining multiple NVRAM devices to improve bandwidth and IOPS performance without using expensive NVRAM devices [29] [25].. 8.. Conclusion. We extended DegAwareRHH, a high performance dynamic graph data store, for distributed-memory on the visitor queue framework. We demonstrated DegAwareRHH processes 206.5 times faster than STINGER, a state-of-the-art shared-memory graph processing framework with 24 threads/processes on a single node to construct a graph where there are 1 billion edge insertion requests and 54 million edge deletion requests. We confirmed that DegAwareRHH preserves high graph construction performance on all of our graph construction workloads, unique or non-unique insertions or deletions, even though the size of graphs increase owing to using a hash table for its edgelist in contrast to STINGER and Baseline models, which are using simple 1D type array for their edge-list. DegAwareRHH also achieves a processing rate of over 2 billion edge insertion requests per second at 128 nodes, on a large-scale real graph with 128 billion edges. 8.

(9) Vol.2016-HPC-153 No.8 2016/3/1. IPSJ SIG Technical Report. 9.. Acknowledgments. This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344 (LLNL-CONF-679551). Funding was partially provided by LDRD 13-ERD-025. Experiments were performed at the Livermore Computing facility.. [22]. [23]. [24]. References [1] [2] [3] [4] [5] [6]. [7]. [8]. [9] [10] [11] [12] [13]. [14]. [15] [16]. [17]. [18]. [19] [20]. [21]. : Graph500, http://www.graph500.org/. Barabási, A.-L. and Albert, R.: Emergence of scaling in random networks, science, Vol. 286, No. 5439, pp. 509–512 (1999). Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A. and Wiener, J.: Graph Structure in the Web, Comput. Netw., Vol. 33, No. 1-6, pp. 309–320 (2000). Buluç, A. and Madduri, K.: Parallel breadth-first search on distributed memory systems, Supercomputing (2011). Celis, P.: Robin Hood Hashing, PhD Thesis, Waterloo, Ont., Canada, Canada (1986). Chakrabarti, D., Zhan, Y. and Faloutsos, C.: R-MAT: A Recursive Model for Graph Mining., in Proceedings of the Fourth SIAM International Conference on Data Mining. Society for Industrial Mathematics, pp. 442–446 (2004). Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A. and Gruber, R. E.: Bigtable: A distributed storage system for structured data, ACM Transactions on Computer Systems (TOCS), Vol. 26, No. 2, p. 4 (2008). Checconi, F., Petrini, F., Willcock, J., Lumsdaine, A., Choudhury, A. R. and Sabharwal, Y.: Breaking the speed and scalability barriers for graph exploration on distributed-memory machines, Supercomputing (2012). Ching, A., Edunov, S., Kabiljo, M., Logothetis, D. and Muthukrishnan, S.: One Trillion Edges: Graph Processing at Facebook-scale, Proc. VLDB Endow., Vol. 8, No. 12, pp. 1804–1815 (2015). Compeau, P. E., Pevzner, P. A. and Tesler, G.: How to apply de Bruijn graphs to genome assembly, Nature biotechnology, Vol. 29, No. 11, pp. 987–991 (2011). Devroye, L., Morin, P. and Viola, A.: On worst-case robin hood hashing, SIAM Journal on Computing, Vol. 33, No. 4, pp. 923–936 (2004). Ediger, D., McColl, R., Riedy, J. and Bader, D.: STINGER: High performance data structure for streaming graphs, High Performance Extreme Computing (HPEC), 2012 IEEE Conference on, pp. 1–5 (2012). Gonzalez, J. E., Low, Y., Gu, H., Bickson, D. and Guestrin, C.: PowerGraph: Distributed Graph-parallel Computation on Natural Graphs, Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, Berkeley, CA, USA, USENIX Association, pp. 17–30 (2012). Goonetilleke, O., Sathe, S., Sellis, T. and Zhang, X.: Microblogging Queries on Graph Databases: An Introspection, Proceedings of the GRADES’15, GRADES’15, New York, NY, USA, ACM, pp. 5:1–5:6 (2015). Hendrickson, B. and Berry, J.: Graph Analysis with HighPerformance Computing, Computing in Science Engineering, Vol. 10, No. 2, pp. 14–19 (2008). Iwabuchi, K., Sato, H., Yasui, Y., Fujisawa, K. and Matsuoka, S.: NVM-based Hybrid BFS with memory efficient data structure, Big Data (Big Data), 2014 IEEE International Conference on, pp. 529– 538 (2014). Kyrola, A., Blelloch, G. and Guestrin, C.: GraphChi: Largescale Graph Computation on Just a PC, Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI’12, Berkeley, CA, USA, USENIX Association, pp. 31–46 (2012). McColl, R. C., Ediger, D., Poovey, J., Campbell, D. and Bader, D. A.: A Performance Evaluation of Open Source Graph Databases, Proceedings of the First Workshop on Parallel Programming for Analytics Applications, PPAA ’14, New York, NY, USA, ACM, pp. 11–18 (2014). Mitzenmacher, M.: A New Approach to Analyzing Robin Hood Hashing, CoRR, Vol. abs/1401.7616 (2014). Nai, L., Xia, Y., Tanase, I. G., Kim, H. and Lin, C.-Y.: GraphBIG: Understanding Graph Computing in the Context of Industrial Solutions, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, New York, NY, USA, ACM, pp. 69:1–69:12 (2015). Pearce, R., Gokhale, M. and Amato, N.: Multithreaded Asynchronous Graph Traversal for In-Memory and Semi-External Memory, High Performance Computing, Networking, Storage and Analysis (SC),. c 2016 Information Processing Society of Japan. [25]. [26]. [27] [28] [29]. 2010 International Conference for, pp. 1–11 (2010). Pearce, R., Gokhale, M. and Amato, N.: Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory, Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pp. 825–836 (2013). Pearce, R., Gokhale, M. and Amato, N. M.: Faster Parallel Traversal of Scale Free Graphs at Extreme Scale with Vertex Delegates, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14, Piscataway, NJ, USA, IEEE Press, pp. 549–559 (2014). Roy, A., Mihailovic, I. and Zwaenepoel, W.: X-Stream: Edge-centric Graph Processing Using Streaming Partitions, Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP ’13, New York, NY, USA, ACM, pp. 472–488 (2013). Sato, K., Mohror, K., Moody, A., Gamblin, T., de Supinski, B., Maruyama, N. and Matsuoka, S.: A User-Level InfiniBand-Based File System and Checkpoint Strategy for Burst Buffers, Cluster, Cloud and Grid Computing (CCGrid), 2014 14th IEEE/ACM International Symposium on, pp. 21–30 (2014). Sleator, D. D. and Tarjan, R. E.: A Data Structure for Dynamic Trees, Proceedings of the Thirteenth Annual ACM Symposium on Theory of Computing, STOC ’81, New York, NY, USA, ACM, pp. 114–122 (1981). Yoo, A., Baker, A., Pearce, R. and Henson, V.: A scalable eigensolver for large scale-free graphs using 2D graph partitioning, Supercomputing, pp. 1–11 (2011). Zerbino, D. R. and Birney, E.: Velvet: algorithms for de novo short read assembly using de Bruijn graphs, Genome research, Vol. 18, No. 5, pp. 821–829 (2008). Zheng, D., Burns, R. and Szalay, A. S.: Toward Millions of File System IOPS on Low-cost, Commodity Hardware, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, New York, NY, USA, ACM, pp. 69:1–69:12 (2013).. 9.

(10)