GPU Task Parallelism for Scalable Ano
maly Detection
Koji Ueno
12Toyotaro Suzumura
1231Tokyo Institute of Technology
2IBM Research – Tokyo 3JST CREST
Overview of Our Scalable Anomaly Detection
System
Car Sensors
Industrial Factories Videos and Images
Real time anomaly detection using GPUs Real time anomaly detection using GPUs
Sensor Data
We propose GPU Task Parallelism to
accelerate many fine-grained small matrix
computations.
We propose GPU Task Parallelism to
accelerate many fine-grained small matrix
computations.
Useful Information
Background
Stream Computing for Real Time Application
Stream computing (or data stream processing)
Realizes real time responses, in contrast to traditional batch process ing where the incoming data is received, then stored, and finally proc essed later.
Many important applications
Latency-critical anomaly detection
Algorithmic trading
Many kinds of middleware have been developed for stream processing.
IBM : System S
Yahoo! : S4
Retrieving valuable data Large amount of streaming data
Large amount of streaming data
Realizing real time application Realizing real time application
Stream Computing in Memory
3/38
Background
Motivating Scenario for Scalable System
Online car health monitoring service
Cars are equipped with health monitoring sensors for various c ar parts.
The central servers gather sensor data from the customers’ ca r and detects anomaly behavior.
Anomaly detection at the industrial factory
Sensors are attached to various parts of machines.
The central server gathers sensor data from the machines and d etects anomaly behavior.
In both cases, the number of sensors may be more than 10
K – 1M.
Processing anomaly detection algorithm will be the bottleneck for these applications.
Background
Anomaly Detection Algorithm SST
We focus on the anomaly det
ection from time series dat
a like sensor data.
SST
(Singular Spectrum Transforma tion)* is one of change-Point detection algorithms, which can be used fo r anomaly detection.
can be used for various situatio ns with a few user defined param eters.
The most dominant part of SST is S VD (Singular Value Decomposition).
* Tsuyoshi Ide, et al. Knowledge Discovery from Heterogeneous Dynamic
Systems using Change-Point Correlations. 2005. SIAM International Conference on Data Mining (SDM 05).
SVD
Ul = [u1; u2; …; ul] β(t)
z(t) = β(t)
T U l Tβ(t)
||Ul Tβ(t)||
Detecting change point with SST Detecting change point with SST
H(t)T G(t)T
time Windows size
5/38
Our System for Scalable Anomaly Detection
Event streams (sensor data) are distributed into multiple compute node s.
Each event stream can be processed independently.
Each compute node has GPUs for accelerating anomaly detection with SS T.
Gathering processed data
We used IBM
System S, which is
a middleware for
distributed stream
computing.
Sensor Data Sensor Data
GPU
Accelerating Anomaly Detection (SST) with G
PGPU
7/38
The most dominant part of SST is SVD (Singular Value Decomp
osition).
For accelerating SVD, we used CULA.
CULA is a commercial library for accelerated linear algebra inclu ding SVD.
CULA is fast for large matrices (>500).
0 10 20 30 40 50 60
GPU CPU
Window Size (= Matrix Size N x N)
Time in Second
Processing Time of SST for a Single Change Point Score Processing Time of SST for a Single Change Point Score
better
Issues on Accelerating SST with GPGPU
However…
GPU is slow for small matrice
s.
SST requires the small matrix co
mputations for the most applicatio
ns.
Moreover, CULA does not supp
ort computing multiple matrices
(i.e. multiple sensor data) in par
allel.
When you compute multiple sens
or data, processing time will incre
ase linearly.
0 10 20 30 40 50
60 GPU
CPU
Window Size (= Matrix Size N x N)
Time in Second
0 1 2 3 4
5 GPU
CPU
Window Size (= Matrix Size N x N)
Execution Time in sec
GPU is slow for small matrices.
better
Issues on Computing Small Matrix with GPUs
Why is the GPU slow for small matrices?
The larger matrix computation can fully utilize GPU cores, howe
ver the smaller matrix computation can not utilize GPU cores.
Large Matrix
Large Matrix Small MatrixSmall Matrix
Many Idle cores Many Idle
cores Utilization
GPU GPU
9/38
GPU Task Parallelism (GTP)
Processing multiple small matrices in parallel
小さい行列 小さい行列
小さい行列 小さい行列
小さい行列 小さい行列
小さい行列 小さい行列
Prior Arts GPU Task Parallelism
Multiple small matrices
Multiple small matrices
Small Matrix Small Matrix
Small Matrix Small Matrix
Methods for GPU Task Parallelism
There are 2 methods for realizing GPU Task Parallelism.
(A): Using concurrent kernel execution
Concurrent kernel execution is the feature of NVIDIA CUDA.
which can execute multiple kernels (the maximum is 16 kernels) in parallel on a singl e GPU.
The maximum number of kernels executed in parallel is 16, which may limit the performance if matrices are very small.
(B): Processing multiple matrices in a single kernel
In this method, each CUDA thread-block compute single matrix.
Since CUDA can execute many thread-block in parallel, single kernel can proces s multiple matrices.
Since each CUDA thread-block is independently executed on a GPU, this metho d is also flexible.
However, this method may require more development cost than method (A).
We used method (B) for accelerating SVD with small matrices.
11/38
Processing Flow of SVD with GPU Task Paralleli
sm
Data Transfer
CPU GPU
①Input Matrix
②Bidiagonalization
③Bidiagonal Values
④Computation of singular values(∑)
⑤Processing Data
⑥Computation of singular vectors(U, V)
⑦Singular Vectors
Intermediate matrices of singular vectors
All our GPU kernels can compute
multiple matrices in parallel !!
All our GPU kernels can compute
multiple matrices in parallel !!
Performance Evaluation
13/38
We evaluated the 2 cases, processing 64 matrices in
parallel and 256 matrices in parallel and compared w
ith CPU version.
CPU version uses LAPACK and ATLAS.
Results are including data transfer time.
All computation is done with single precision floating point
values.
Environment
CPU: AMD Phenom X4 9850 (4 cores)
GPU: Tesla C1060
MEM:4GB, OS:Cent OS 5.4, CUDA 3.2
32 96 160 224 288 352 416 480 GPU/CPU 1 core (task=64)
GPU/CPU 1 core (task=256) GPU/CPU 4 cores (task=64)
Matrix Size (N x N)
Speedup
Performance Evaluation of our SVD with GPU Task
Parallelism
The number of “task” means the number of matrices processing in parallel.
GPU Compared with 1 CPU coreGPU is 3-4x faster!
GPU is 3-4x faster!
GPU Compared with 4 CPU cores
Small Matrix
CPU:AMD Phenom X4 9850 Machine Spec:
Performance Comparison with CULA
For 256 x 256 matrix, CULA is slower than CPU but GPU with GTP is faster th
an CPU.
GPU CPU 参考 :CULA
0 0.2 0.4 0.6 0.8 1 1.2
0.05
0.37
s r E m ea tr S nt ve pe Pr e im T ng si es oc
1.06better
CPU:AMD Phenom X4 9850 GPU:Tesla C1060
Matrix:
256x256
Matrix:
256x256
15/38
Machine Spec:
GPU with GTP GPU with CULA
Previous approach seconds
1 coreCPU
Evaluation of the Total System Performance
We evaluated maximum throughput of our system.
We used 4 nodes at maximum.
Each node has 1 socket of CPU and 4 GPU boards.
CPU:AMD Phenom X4 9850(4 cores)
GPU:GeForce 8800 GTS 512 x 4
We used System S for distributed processing.
Each node has 4 GPUs
Sensor Data from Disk Sensor Data from Disk
0 4 8 12 16 0
50 100 150 200 250 300 350
Number of GPUs/CPU cores
Throughput (# of scores/sec)
Evaluation of the Total System Performance
CPU:AMD Phenom X4 9850(4 cores) GPU:GeForce 8800 GTS 512 x 4
7x speed up
comparedwith CPU
Matrix:
320x320
Matrix:320x320
17/38
Machine Spec:
Related Work
Accelerating SVD with GPGPU
Sheetal Lahabar, P. J. Narayanan. Singular value decomposition on G
PU using CUDA. IPDPS. 2009.
Their target is large matrices, not small matrices.
Task parallelism with CUDA
Marisabel Guevara, et al. Enabling Task Parallelism in the CUDA Scheduler. P MEA, pp.69-76, September 2009.
With their method, several CUDA kernel can be launched as a single kernel. C urrent CUDA supports the similar feature as concurrent kernel executions.
Real-time processing with GPGPUs
Sangjin Han, et al. PacketShader: a GPU-Accelerated Software Rout
er, SIGCOMM 2010, New Delhi, India, Aug 30 - Sep 3, 2010.
They proposed a novel framework for the GPU-accelerated softwar
e router.
Conclusions and Future Work
Conclusions
We proposed scalable anomaly detection with GPGPU.
There is a problem on accelerating SST with GPGPU.
GPU is not efficient for small sized matrices.
We solved this problem by processing multiple matrices in parallel.
We call this approach “GPU Task Parallelism”.
We developed GPU accelerated SST using GPU task parallelism.
Our performance evaluation shows that we succeeded to accelerate SST w
ith GPGPU.
Future Work
We need to evaluate the cost of implementing GPU task parallelized CUDA
kernel.
We implemented our CUDA kernel by scratch. This is very high cost.
19/38