Chapter 4 Offloading with a naive approach: DS-CUDA case
This is a well-known technique among experts on GPGPU for CUDA programming because copying data from the CPU to the GPU is a very expensive time-consuming operation.
Next, we report the numbers from the CPU implementation. In this case, OpenMP is used to compute the force between the particles. The outcome of this experiment is plotted in Figure 4.10 A). The notebook reaches 0.41 frames/sec for a large number of particles.
The Jetson K1 and SHIELD tablet accomplish only 0.026 and 0.011 frames/sec for 1,728 particles. For a smaller number of particles, using the CPU as a force accelerator is the best visualization option because it excludes the communication bottleneck between the CPU and GPU. However, for a larger number of particles computing force between atoms becomes the bottleneck. In this case, GPU becomes the optimal solution.
Effects of the communication can be observed in Figures 4.9 and 4.10. Here, we denoted that the communication frequency is reduced to 1/10 when Step = 100 is used, compared with Step = 10. As we can note, the DS-CUDA performance is low for a small a number of particles because of network overhead. Nevertheless, hiding this latency was possible by increasing the number of steps in the MD simulation to keep the GPU busy on the server-side.
From the results, we showed that the number of steps to update the system directly affects the frames per second. For Step = 10, the frames per second for the DS-CUDA system reached more than 19 frames/sec. However, increasing the number of steps to 100 directly affects rendering time. Importantly, the Jetson K1 could not handle more than 3 frames per second for the larger number of particles. This was owing to a combination of fewer flops and poorer rendering performance compared with the tablet-notebook combination.
Section 4.4 Conclusion
We show an MD simulation and visualization including several hundred particles in the system. However, in order to increase the size of the simulation, a similar approach from previous DS-CUDA implementations could be followed. It has been proven that DS-CUDA can be used in a multiple GPU environment for MD simulation. Nevertheless, latency and communication between nodes could become a bottleneck in our proposed system.
Our heterogeneous system proved to be suitable for executing an interactive molecular dynamics simulation. Using the DS-CUDA virtualization framework, only kernels for inten-sive computation are offloaded to the server-side. Mobile devices are not expected to perform intensive computations due to saving battery life and low powered CPU. However, cloud computing or similar systems like ours are an interesting approach to simultaneously achieve more computational power on mobile devices.
5
Reducing communication latency through Dynamic Parallelism:
rCUDA case
Interactive modeling, such as interactive Molecular Dynamics (MD) simulations [100, 101], enables the artificial acceleration of simulations through manual interaction. Mobile devices are suitable for such simulations because they have touch capability and multiple sensors.
Nevertheless, mobile devices require more computational power to deliver the best user expe-rience for such intensive computational tasks, because simulations like these are characterized by high frame rates and processor-intensive routines.
Cloud computing provides the ability to remotely connect with other machines and hook up accelerators like GPUs. In a cloud environment, virtualization tools such as GVirtuS [102], Shadowfax[103], GPUvm[104], MGP [105], vCUDA [106], GridCuda [107], DS-CUDA [108, 11], and rCUDA [109, 110] have been proposed in order to use remote GPUs. These tools and frameworks are able to manipulate remote GPUs to accelerate applications in a cloud environment. In particular, rCUDA has proven to be a reliable, simple, and up-to-date solution for handling remote GPUs [111, 112, 113, 114]. In the previous Chapter, we were able to use DS-CUDA as a medium of connection between an Android tablet and a remote GPU from a notebook. However, as we will see in section 5.5, the performance delivered from rCUDA overpass the one from DS-CUDA.
We analyze the computing, rendering, and power efficiency when the rCUDA framework is used to accelerate computations on a mobile device, offloading most of its intensive com-putations to a notebook leveraged by a low-power GPU. As we can denote, despite their great acceleration and high performance, desktop GPUs are considered non-green computing solutions since they consume around 270 W [115], substantially more than the low-power
Chapter 5 Reducing communication latency through Dynamic Parallelism: rCUDA case
GPUs present on notebook computers which are in order of the 170 W. This is because of they are designed for energy efficiency and low power use [10, 116].
Compared to the previous Chapter, here we present a performance evaluation of a het-erogeneous system composed of a non-CUDA-capable tablet device and a notebook powered by a low-power GPU. We show the effectiveness of using GPGPU techniques such as Dy-namic Parallelism (DP) to reduce the kernel call latency. As well, we investigated using a server/client scheme. Moreover, the possibility and outcome of increased power efficiency using various clients are shown.
There have been some proposals that implement a paradigm similar to ours. Fatica et al. [25] implemented a synthetic aperture radar imaging application using a Tegra K1, which is a CUDA-capable GPU. In both cases, speed improvements were achieved implementing the GPU compared against CPU implementation. However, the authors on their study did not include any outcomes on performance per watt or battery life. Heungski et al.[28] and Kemp et al.[27] conducted a set of a test similar to ours. The main difference between our approach and that of Heungski et al. is the API used for offloading. They chose OpenCL, because it is open source and covers more devices to offload, whereas we use CUDA because of its presence in HPC is clear [99] and rCUDA is able to handle CUDA code. Kemp also used rCUDA to offload intensive computations to mobile devices. Our proposal is related to theirs in the sense that we both claim speed gains when heavy parts are offloaded for certain applications. Additionally, both proposals present results about energy usage. However, their study shows that for exposure fusion algorithm on images there is no lead to better execution or saving power consumption. The main reason is that they used CPU on the client side (Tablet) for image compression and thus the amount of data sent to remote GPU is reduced. We were able to tackle the communication problem in a different way. We implemented Dynamic Parallelism to reduce GPU kernel calls. Also, they consider only client-side power consumption, whereas we include both client-and server-side consumption for performance per watt measurements. Furthermore, we also examine the power efficiency for combinations of multiple clients. Another study related to low-power systems is that of Rea˜no et al. [30], who investigated the performance of rCUDA on a combination of low-powered CPUs such as ARM, Atom, and Xeon D. They used the GROMACS package to conduct MD simulations and concluded that the acceleration and handling of the virtual GPUs by the Xeon D processor was superior to that using the ARM or Atom. However, they did not present any power consumption results. Montella et al. [29] proposed to use offloading for heavy computations from an ARM cluster (Client) composed by 3 NVIDIA Jetson TK1 utilizing GVirtuS framework to a remote GPU TITAN X (Server). Although, Jetson TK1 contains on SoC with a CUDA capable GPU, they offload several sizes of matrix multiplications to a server and compared the results against the local execution, claiming 54
Section 5.1 Communication Optimization Policy
gain in performance when offloading. Furthermore, they report that latency is neglected as the problem size increases. Despite the similitude to our proposal, this study does not tackle a real-time application, including several copy memory functions or kernel calls. Moreover, they do not include power metrics between server or client, even thought GPUs such as TITAN X are very power-hungry, consuming around ∼250W. In our study, we selected a low-power client and server since we want to squeeze every Gflop/Watt delivered from the system.
The contents of this Chapter is organized as follows. Sections 5.1 and 5.2 provides an insight into how the strategy for implementing DP was decided to compare results from the previous Chapter. Section 5.3 provides a brief description of rCUDA, before Section 5.4 describes each component of the test system. In Section 5.5, we present the results obtained from a series of tests. Finally, in Section 5.6, we discuss and summarize the contents.