Accelerating Hilbert-Huang Transform using GPU

全文

(1)Vol.2010-HPC-126 No.3 2010/8/3 IPSJ SIG Technical Report. Accelerating Hilbert-Huang Transform using GPU Pulung Waskito,†1 Shinobu Miwa,†1 Yasue Mitsukura†1 and Hironori Nakajo. EMD. HSA. Input data. †1. IMFs. In this paper, we show parallel implementation of Hilbert-Huang Transform on GPU. This implementation focused on the reducing the computation complexity from O(N ) on a single CPU to O(log(N )) on GPU, as well as the use of shared-global switching method to increase performance. Evaluation results show our single GPU implementation using Tesla C1060 achieves speedup of 27.7 in best case, and a total speedup of 4.7 for all results when compared to a single Intel dual core CPU.. Time-Frequency domain. Fig. 1 Hilbert-Huang transform (HHT). times speedup for all results when compared to a typical general purpose CPU’s execution time. The rest of this paper is organized as follows. Section 2 summarizes the bottleneck parts of HHT. Section 3 describes the details of our parallelization strategy for HHT using CUDA. Experimental results are analyzed in Section 4. Finally, Section 5 summarizes our work and concludes the paper.. 1. Introduction In the field of signal processing, it is common to analyze complex sets of data, e.g: harmonic signals, stock price fluctuation, and yearly average temperature. These complex data are so called non-stationary data, where the frequency changes over time. Traditionally, non-stationary data is analyzed using ShortTime Fourier Transform (STFT). However, because STFT has its own drawback, i.e. there exists a trade-off between frequency and time resolution, an adequate analysis cannot be performed. In recent years, new methods have been introduced to overcome such disadvantage1)2) . One of such methods is Hilbert-Huang Transform (HHT)1) . HHT is designed as an adaptive, empirically based data-analysis method for non-stationary process. On the other hand, HHT needs a large amount of computation. As discussed later, such complexity comes from the algorithm having two nested loops structure with the inner loop having an O(N ) complexity. As a result, it takes hours to analyze even for a short 3 seconds harmonic signal. We take advantage of CUDA to develop a parallel model of HHT. As a result, our implementation has achieved a 27.7 times speedup in best case, and 4.7. 2. The Hilbert-Huang transform As shown in Figure 1, HHT consists of two different algorithms. One is called Hilbert Spectral Analysis (HSA), which computes the instantaneous frequency from the data. However, HSA cannot be used on data where multiple frequencies are mixed. Therefore, the other algorithm, Empirical Mode Decomposition (EMD) acting as a preprocessor is applied to the input data prior to HSA. Although the details are omitted here, the workload of HSA is substantially smaller compared to the workload of EMD. Additionally, the majority of HSA process consists of applying Fast Fourier Transform (FFT) and its inverse to the data over the frequency domain. The parallelization of FFT and its inverse have been studied extensively over the years, and there have been previous works concerning the FFT implementation in CUDA3) 4) . Therefore, this paper focuses only on the parallelization and implementation of EMD in CUDA, where no previous work has been done before. The EMD procedure is explained in more details as follows.. †1 Tokyo University of Agriculture and Technology. 1. c 2010 Information Processing Society of Japan .

(2) Vol.2010-HPC-126 No.3 2010/8/3. IPSJ SIG Technical Report IPSJ SIG Technical Report. Amplitude. 10. start r ← input. Data Envelope Envelope Mean : m1. h←r. 0. h ← h1. -10 200. 250. 300. 350. 400. 450. identify extrema. (a). calculate envelope. (b). 500. Time: second. update residue r ← r – h1. Fig. 2 Envelope and mean calculation. generate protoIMF h1 (c) NO. h1 = IMF?. YES. 2.1 EMD overview The basic idea of the Empirical Mode Decomposition is a simple assumption that any data consists of different simple intrinsic modes of oscillation1) . At any given time, the data may have many different coexisting modes of oscillation one superimposing on the others. The result is the final complicated data. Each of these oscillatory modes is represented by an intrinsic mode function (IMF) with the following definition: • in the whole dataset, the number of extrema and the number of zero-crossings must either equal or differ at most by one, and • at any point, the mean value of the envelopeiFigure 2jdefined by the local maxima and the envelope defined by the local minima is zero. To decompose the input signal f (t) into IMFs, the EMD algorithm is shown as a flowchart in Figure 3. First, identify all the local extrema (step a), then connect all the local maxima by a cubic spline line5) to produce the upper envelope (step b). Repeat the procedure for the local minima to produce the lower envelope (step b). The upper and lower envelopes should cover all the data between them (see Figure 2). Their mean is designated as mi (t), and the difference between the data and mi (t) is calculated as hi (t) = si (t) − mi (t)(step c). Unfortunately, hi (t) does not yet satisfy the definition of an IMF. Such function is called a protoIMF. The same procedure is simply executed to hi (t) during the inner loop process until it reaches the definition of IMF. After repeated sifting, hi (t) becomes an IMF. Overall, hi (t) should contain the. YES. extrema exist?. NO. end Fig. 3 EMD algorithm. finest scale of the signal. Then hi (t) can be separated from the rest of the data by ri (t) = si (t) − hi (t). Since the residue ri (t) still contains longer frequency variations in the data, it is treated as the new data and subjected to the same sifting process as described above. This procedure can be repeated with all the subsequent ri (t), until the residue ri (t) becomes a monotonic function from which no more IMFs can be extracted. By definition, the relation between the input data f (t) and IMFs hi (t) is as follows: f (t) =. n . hi (t) + r(t). (1). i=1. As described below, the complexity of each of the inner loop procedure is shown to be O(N ). 2.2 Extrema identification Each extrema point (maxima and minima) can be identified by comparing the value of input signal si with the previous and behind value in sequence. 2. c 2010 Information Processing Society of Japan .

(3) Vol.2010-HPC-126 No.3 2010/8/3 IPSJ SIG Technical Report 0. 1. 2. 3. 4. 5. 6. 7. 2. 5. 6. 3. 8. 5. 9. 4. max_x. 2. 4. 6. max_y. 6. 8. 9. h. The cubic splines are especially practical because the set of Equation (3) can be arranged in a tridiagonal matrix just as shown below. ⎡ ⎤⎡ ⎤ ⎡ ⎤ b 1 c1 0 y r ⎢a b ⎥ ⎢ 1 ⎥ ⎢ 1 ⎥ c ⎢ 2 ⎥ ⎢y2 ⎥ ⎢ r2 ⎥ 2 2 ⎢ ⎥⎢ ⎥ ⎢ ⎥ .. ⎢ ⎥ ⎢y ⎥ ⎢ r3 ⎥ . (4) a3 b 3 ⎢ ⎥⎢ 3⎥ = ⎢ ⎥ ⎢ ⎥⎢ . ⎥ ⎢ . ⎥ .. .. ⎢ ⎥ ⎣ .. ⎦ ⎣ .. ⎦ . . cn−1 ⎦ ⎣ yn rn 0 an bn These equations can be solved in O(n) operations by the Tridiagonal Matrix Algorithm (TDMA)5) . TDMA is a simpler form of Gaussian elimination method, where if the equations can be represented as a tridiagonal matrix, the solution can be obtained by just one forward sweep and one backward elimination. Once yi are obtained, simply applying Equation (2) brings the resulf of the envelope. 2.4 protoIMF generation After the upper envelope and lower envelope are calculated, finding the mean value hi can be calculated in O(N ) operations. The C code snapshot of the operation is shown below. for( i = 0; i<N; i++ ){ m[i] = (upper_env[i] + lower_env[i])/2.0; h1[i] = h[i]-m[i]; }. Fig. 4 Extrema Padding. i = 0, · · · , N − 1. Since this procedure sequentially compares each values in the data, the complexity can be expected to be O(N ). The x and y value of each maxima (minima) is then stored in new arrays. After this procedure, separate arrays of maxima and minima set are created which are used in the following envelope calculation, shown in Figure 4. This figure shows a case for maxima identification, with the gray part representing the identified maxima. 2.3 Envelope calculation Cubic Spline Interpolation is used to interpolate the interval between extremas creating an envelope. The essential idea of cubic spline interpolation is as follows. Given a set of extremas (xj , yj ), j = 1, · · · , n, the cubic spline curve (ei ) connecting them can be calculated using the following equation, ei = Ayj + Byj+1 + Cyj + Dyj+1 (2) where xj < i < xj+1 , and yj as the second derivatives of each xj . Each A, B, C and D can be calculated using xj and xj+1 respectively6) . In a practical use, since the values of each yi are unknown, they need to be specified before calculating Equation (2). The key idea of a cubic spline requires the first derivative be continuous across the boundary between two intervals (yj from interval xj−1 and xj is the same as yj from interval xj and xj+1 ). After taking the derivatives of Equation (2) and rearranging it, the following equation is obtained and can be used to calculate a unique set of yi values6) .. 3. Parallelization strategy for EMD In this section, the details on our proposed parallelization strategy for EMD are explained. 3.1 Shared memory usage and its problem A common implementation in CUDA is to split the tasks to group of thread blocks, where each block is executed in parallel by each Streaming Multiprocessor (SM) in GPU. Each SM has a small on-chip shared memory that can be accessed very quickly by every thread in the same block. Maximizing the use of this shared memory is the focus in our parallelization strategy. To use the shared memory fully, the tasks are divided such that each task fits in. ayj−1 + byj + cyj+1 =r (3) where a, b, c and r are also calculated using xj , xj−1 and xj+1 respectively5) .. 3. c 2010 Information Processing Society of Japan .

(4) Vol.2010-HPC-126 No.3 2010/8/3 IPSJ SIG Technical Report. each SM’s shared memory. Ideally, each block applies the inner loop procedure as described on the previous section only to the data inside its own shared memory. However, as explained in Section 2.3, during the envelope calculation, the coordinate of every extrema points need to be considered to calculate the second derivative yj in Equation (4). Since each block is isolated and is practically executing independently from each other, considering all the extrema from the input data is practically impossible. Each block cannot access the extrema data outside its own shared memory. Upon calculating the envelope with the limited data, the envelope from each block will not be smooth and continuous across the boundary between blocks. As a result, the resulting IMF may contain error. Therefore, the data is overlapped for some interval around the boundary during the splitting of the input. This behavior is shown in Figure 5. Each block has an extra part of the data around the boundary allowing it to calculate extra part of the envelope. After the inner loop procedure terminates, each block discards the extra part and outputs the result, resulting in one continuous IMF. However, there is a limitation in this shared method. If the number of extrema found in the data inside each block is too small for envelope to be produced. A certain number of extrema in each thread block is needed to maintain one continuous envelope across blocks. To facilitate this, we propose a switching method between shared and global memory explained in the next section. 3.2 Coordination between shared and global memory methods Our proposed coordination method is shown in Figure 6. When the kernel is launched, the shared method is first used to determine the number of extrema in each block (step a). If the number is larger than a threshold, the shared method is continue (step b). However if the number is lower than a threshold, the EMD method is switched to the global method (step c), where EMD is calculated this time using input data in the global memory. In our proposed strategy, the switch to global method is executed inside the kernel, so to facilitate inter-block communication, a GPU lock-based synchronization is used7) . The basic idea is to use a global mutex variable combined with atomicAdd function to count the number of thread blocks which reach the synchronization point. It acts as a barrier function stalling each thread block until every block reaches the same point.. SM0 Block0. SM1. SMM. Parallelize. Fig. 5 Overview of parallel EMD. Since in a CUDA programming model, the execution of a thread block is nonpreemptive, i.e. its execution cannot be interrupted until it is finished, care must be taken when attempting to coordinate a group of thread blocks to avoid deadlocks. Therefore, a one-to-one mapping between thread blocks and SMs is applied. For a GPU with ’M’ number of SMs, no more than ’M’ blocks can be used in the kernel. After completing the first IMF calculation, a new input data is calculated and the procedure described above is repeated. Next the parallelization strategy for each of the inner loop procedure is described with focus on reducing the complexity of each procedure. 3.3 Extrema identification In this procedure, each thread decides whether or not one point is an extrema. Since comparison with the point before and behind that point can be done for each point independently, extrema identification is simple to perform on a CUDA. 4. c 2010 Information Processing Society of Japan .

(5) Vol.2010-HPC-126 No.3 2010/8/3 IPSJ SIG Technical Report. start SHARED split tasks into shared memory update residue r ← r – h1. count extrema in each block. NO perform EMD on shared memory (b) extrema exist?. 1. 2. 3. 4. 5. 6. 7. h. 2. 5. 6. 3. 8. 5. 9. 4. max_id. 0. 0. 1. 0. 1. 0. 1. 0. (a). + + + + + + + +. YES. extrema < threshold?. YES. GLOBAL. 0. max_id perform EMD on (c) global memory. 0. 0. 1. 1. 1. 1. 1. 1. + + + + + + + + max_id 0. 0. 1. 1. 2. 2. 2. 2. + + + + + + + + NO. max_id 0. end. 0. 1. 1. 2. 2. 3. 3. Fig. 6 EMD shared and global method. Fig. 7 Prefix sum behavior. implementation. Each thread is mapped to a single point in the data to perform the comparison in parallel. However, problem arises when trying to store the extrema coordinate into separate arrays. Just as described in Section 2.2, the subsequent envelope calculation needs separate arrays for x and y values as shown in Figure 4. Here prefix sum method is used for the padding procedure to obtain the max x and max y arrays. Figure 7 shows the prefix sum behavior on an array with 8 elements. First the element of each of identified maxima is set to 1, otherwise set to 0. Then apply the prefix sum as shown in the figure, resulting each element to be the sum of all the elements in the array up to its index. Finally using this value as a new array index, the value from the data are copied to the max x and max y arrays shown in Figure 4. Since the prefix sum requires log(N ) operations, the complexity of extrema identification becomes O(log(N )).. 3.4 Envelope calculation Just as described in Section 2.3, during the envelope calculation, the solution for tridiagonal matrix equations is calculated. To solve these tridiagonal matrix equations on GPU, instead of TDMA, cyclic reduction method known for its high parallelism is used8) . In a cyclic reduction method, the Gaussian elimination procedure is performed in parallel, eliminating all of the terms for each row until a matrix which has only one diagonal non zero entries is obtained. First, the first three rows of matrix Equation (4) can be transformed into general form shown in equations below. b1 y1 +c1 y2 = r1 (5) a2 y1 +b2 y2 + c2 y3 = r2 (6) a 3 y 2 + b 3 y 3 + c3 y 4 = r3 (7) Focusing on Equation (6) on the 2nd row, the a2 y1 term can be eliminated by substituting y1 = (r1 − c1 y2 )/b1 from the Equation (5) on the 1st row. In. 5. c 2010 Information Processing Society of Japan .

(6) Vol.2010-HPC-126 No.3 2010/8/3 IPSJ SIG Technical Report. a similar fashion, the c2 y3 term can also be eliminated by substituting y3 = (r3 − a3y2 − c3y4 )/b3 from the Equation (7) on the 3rd row. The key idea of cyclic reduction is to eliminate the aj yj−1 term and cj yj+1 term for each row j using the equations above and below it. Because the elimination for each row can be performed independently, cyclic reduction can easily be calculated in parallel. After the first step, the aj yj−1 and cj yj+1 term for each row j are eliminated and replaced with aj yj−2 and cj yj+2 term. The Equation (5),(6),(7) become, b1 y1 +c1 y3 = r1 (8) b 2 y 2 + c2 y 4 = r2 (9) a3 y1 + b3 y3 + c3 y5 = r3 (10) which can be written⎡back into the matrix form ⎤ ⎡ as, ⎤ ⎡ ⎤ b1 0 c1 0 y1 r1 ⎢ 0 b ⎥ ⎢ r ⎥ ⎥ 0 c2 ⎢ ⎥⎢ 2 y ⎢ 2⎥ 2⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥ .. ⎢ ⎥⎢ y3 ⎥ = ⎢ r3 ⎥ . (11) ⎢a3 0 b3 ⎥⎢ ⎢ ⎥ ⎢ ⎥⎢ .. ⎥ ⎥ ⎢ .. ⎥ .. .. .. ⎢ ⎥⎢ . . . 0 ⎦⎣ . ⎦ ⎣ . ⎦ ⎣ yn rn 0 an 0 bn Notice on the left side of the equations, each variable aj and cj changes position after each step with aj moving to the left and cj to the right respectively, while variable bj stays on the same position. This procedure is then repeated to the new equations, this time substituting the aj yj−2 and cj yj+2 term from each row j with row j − 2 and⎡j + 2. After the second step, the equations become, ⎤⎡ ⎤ ⎡ ⎤ b1 0 0 c1 0 y1 r1 ⎢ ⎢ ⎥ b2 0 0 c2 ⎥ ⎢ ⎥⎢ y2 ⎥ ⎢ r2 ⎥ ⎥ ⎢ ⎥⎢ ⎢ ⎢ ⎥ . ⎥ ⎢ ⎥ . ⎢ ⎢ ⎥ y r . (12) b3 ⎢ ⎥ ⎢ 3 ⎥ = ⎢ 3⎥ ⎢ ⎥⎢ . ⎥ ⎢ . ⎥ .. .. ⎢ ⎥ ⎣ .. ⎦ ⎣ .. ⎥ ⎦ . . 0⎦ ⎣ y r n n 0 an 0 0 bn with the variable aj and cj moving farther and eliminated completely from the equations. The subsequent procedure is done using row j−4 and j+4 respectively. Finally, after log(n) operations, a matrix which is having one diagonal non zero entries is obtained just as shown in Equation (13). Looking at these matrix equations, it is naturally clear that the solution is trivial.. ⎡ ⎤ ⎡ ⎤ ⎡ ⎤ y1 r1 b1 0 ⎢ ⎥ ⎢y ⎥ ⎢ r ⎥ b ⎢ ⎥ ⎢ 2 ⎥ ⎢ 2⎥ 2 ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢y3 ⎥ = ⎢ r3 ⎥ b3 (13) ⎢ ⎥⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ .. ⎥ ⎢ .. ⎥ .. . ⎣ ⎦⎣ . ⎦ ⎣ . ⎦ 0 bn yn rn Thus, the complexity of envelope calculation can be reduced to O(log(n)). 3.5 protoIMF generation There exists a data parallelism in the code shown in Section 2.4. Therefore, the protoIMF generation can easily be parallelized on the GPU. Each thread is mapped to each point where they first calculate the mean and then proceed to generate the protoIMF. 4. Evaluation results First the preliminary evaluation of our shared EMD method is shown. Then the evaluation results of our shared-global EMD method are presented. The CPU and GPU environments used to evaluate our proposed parallel EMD are shown in Table 1. We use a single Intel Core 2 Duo 2.53GHz for CPU, and TESLA C1060 with 240 SP (Stream Processor) for the GPU environment. For the input data, the harmonic signal Hotel California.wav sampled at 44.1kHz is chosen. Sampling point from 512 until 512K from the beginning is used as test data sets. For our shared EMD method evaluation, 64 and 128 points are chosen for the overlap interval. Each block of 256 threads group is responsible for the EMD calculation of each 512 sampling points, therefore each thread in a block is in charge for 2 points. Since the shared EMD method is unusable for the subsequent IMF because the number of extrema will have dropped, only the execution time for IMF1 is evaluated. Additionally, the check to decide Table 1 Evaluation environments Processor Clock Speed @ Development Environment. 6. CPU Core 2 Duo 2.53GHz CentOS 5.3 gcc-4.1.2 -O3. GPU TESLA C1060 602MHz CUDA SDK 2.3 nvcc. c 2010 Information Processing Society of Japan .

(7) Vol.2010-HPC-126 No.3 2010/8/3 IPSJ SIG Technical Report 2000. 2000. 1000. 0. 0 -1000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 500. 1000. 1500. 2000. 2500. 3000. 3500. 4000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 4000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 4000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 4000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 4000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 4000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 4000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 4000. 0. -2000 2000. -500 2000. 1000 0 -1000. 0 -2000 500. 4000. 0. 500. 1000. 1500. 2000. 2500. 3000. 3500. 0. 4000. -2000 1500. -2000. Fig. 8 IMF1 calculated using CPU (green) and GPU shared (overlap=64)(blue). 0 -1500 2000. 30. Speedup (x). 25. 0. gpu (overlap=64) gpu (overlap=128). -2000 1000. 20. 0. 15. -1000 600. 10. 0. 5. -600 300 0. 0. -300. Fig. 10 IMF1 to IMF8 by GPU shared-global method (blue is shared, red is global). Sampling Point Fig. 9 Speedup of GPU shared method on IMF1. of SMs in Tesla. Thus each thread in a block will be mapped to multiple points in the data according to the input size. The threshold value for switching is set to 36, a value obtained from our experiments. In Figure 10, IMF1 to IMF8 results are shown. The first three IMFs shown in blue are calculated using the shared method, while the subsequent IMFs shown in red are calculated using the global method. In this case, after IMF3, the number of extrema inside one of the blocks is smaller than the threshold causing the switch to a global method. Figure 11 shows the execution time of both the global and shared-global method when calculating all IMFs. Figure 12 shows the speedup comparison to a single CPU. From these figures, a maximum speedup of 4.7 is achieved for shared-global. whether the protoIMF satisfy the IMF condition is not performed, but instead stopping the inner loop procedure after 1,000 loops. Figure 8 shows the result of IMF1 from both the CPU (green) and our proposed GPU shared implementation (overlap=64) (blue). There are practically no differences on both of the IMF confirming the effectiveness of the overlap interval. Figure 9 shows the speedup comparison when calculating IMF1, achieving a maximum of 27.7 speedup with the shared EMD implementation (overlap=64). For the global-shared EMD method, the thread block is set to 30, the number. 7. c 2010 Information Processing Society of Japan .

(8) Vol.2010-HPC-126 No.3 2010/8/3. IPSJ SIG Technical Report IPSJ SIG Technical Report. cpu. gpu (global). gpu (shared+global). 5. Conclusion. 200 180. This paper presents an evaluation of our initial implementation of the HilbertHuang Transform for CUDA-enabled devices. A brief introduction to the HilbertHuang Transform, especially the Empirical Mode Decomposition has been provided prior to explaining our parallelization strategy. The results show that our proposed parallelization strategy reduces the complexity to O(log(N )) when used on a GPU. The shared method achieves a maximum speedup of 27.7 times for calculation of a single IMF compared to sequential implementation on a CPU. Meanwhile the shared-global method achieves a maximum speedup of 4.7 times for calculation of all IMFs. Additionally, setting up an overlap interval to achieve the same result to reduce error is shown to be effective. As for future work, we will further investigate a new approach to reduce the overhead for a global method and to increase the shared method usage for better performance. Second, we will add our own implementation of parallel Hilbert Spectral Analysis and perform exhaustive comparison on the final result. Acknowledgments This research was partially supported by Japanese MEXT Fund for Promoting Research on Symbiotic Information Technology.. Time (sec). 160 140 120 100 80 60 40 20 0 15360. 30720. 61440. 122880. 245760. 491520. Sampling Point Fig. 11 Execution time using CPU, GPU global, and GPU shared-global method. 5. Speedup (x). 4. gpu (shared+global) gpu (global). 3. References 2. 1) Huang N. E. et al.: ”The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis”D Proc. R. Soc. Lond. A454, p.903-995 (1998)D 2) R.M. Rao and A.S. Bopardikar: ”Wavelet Transforms: Introduction to Theory and Applications”. Addison Wesley Longman (1998). 3) Govindaraju N. K. et al.: ”High Performance Discrete Fourier Transforms on Graphics Processors.” Proc. of Supercomputing, p.1-12 (2008). 4) Nukada A. et al.: ”Bandwidth Intensive 3-D FFT Kernel for GPUs Using CUDA. Proc. of Supercomputing (2008). 5) McKinley S. et al.: ”Cubic Spline Interpolation”. College of the Redwoods (1998)D 6) Press W. H. et al.: ”Numerical Recipes in C: The Art of Scientific Computing”D Cambridge University Press; 2nd edition, p.113-116 (1992). 7) Xiao S. and Feng W.C.: ”Inter-Block GPU Communication via Fast Barrier Synchronization”. Proc. of Int. Parallel and Distributed Processing Symposium (2010). 8) Bini D.A. and Meini B.: ”The cyclic reduction algorithm: from Poisson equation to stochastic processes and beyond”. Numer. Algor. vol.31, p.23-60 (2009).. 1. 0 15360. 30720. 61440. 122880. 245760. 491520. Sampling Point Fig. 12 Speedup of the GPU shared-global method on all IMFs. method. From these results, though the speedup for a single IMF calculated using shared method is higher, the speedup of the global method is lower. We believe that this is caused by the communication overhead involving the global memory access and inter-block synchronization. Therefore, we try to reduce this overhead and increase the shared method usage in the future work.. 8. c 2010 Information Processing Society of Japan .

(9)