StreamGPU プロジェクト鈴村研究室大規模データ処理・ストリームコンピューティング StreamGPU HiPC2012

(1)

GPU Task Parallelism for Scalable Ano

maly Detection

Koji Ueno

¹²

Toyotaro Suzumura

¹²³

1Tokyo Institute of Technology

2IBM Research – Tokyo ³JST CREST

(2)

Overview of Our Scalable Anomaly Detection

System

Car Sensors

Industrial Factories Videos and Images

Real time anomaly detection using GPUs Real time anomaly detection using GPUs

Sensor Data

We propose GPU Task Parallelism to

accelerate many fine-grained small matrix

computations.

We propose GPU Task Parallelism to

accelerate many fine-grained small matrix

computations.

Useful Information

(3)

Background

Stream Computing for Real Time Application

 Stream computing (or data stream processing)

 Realizes real time responses, in contrast to traditional batch process ing where the incoming data is received, then stored, and finally proc essed later.

 Many important applications

 Latency-critical anomaly detection

 Algorithmic trading

 Many kinds of middleware have been developed for stream processing.

 IBM : System S

 Yahoo! : S4

Retrieving valuable data Large amount of streaming data

Large amount of streaming data

Realizing real time application Realizing real time application

Stream Computing in Memory

3/38

(4)

Background

Motivating Scenario for Scalable System



Online car health monitoring service

 Cars are equipped with health monitoring sensors for various c ar parts.

 The central servers gather sensor data from the customers’ ca r and detects anomaly behavior.



Anomaly detection at the industrial factory

 Sensors are attached to various parts of machines.

 The central server gathers sensor data from the machines and d etects anomaly behavior.



In both cases, the number of sensors may be more than 10

K – 1M.

 Processing anomaly detection algorithm will be the bottleneck for these applications.

(5)

Background

Anomaly Detection Algorithm SST



We focus on the anomaly det

ection from time series dat

a like sensor data.



_SST

(Singular Spectrum Transforma tion)*

 is one of change-Point detection algorithms, which can be used fo r anomaly detection.

 can be used for various situatio ns with a few user defined param eters.

 The most dominant part of SST is S VD (Singular Value Decomposition).

* Tsuyoshi Ide, et al. Knowledge Discovery from Heterogeneous Dynamic

Systems using Change-Point Correlations. 2005. SIAM International Conference on Data Mining (SDM 05).

SVD

U_l= [u₁; u₂; …; u_l] β(t)

z(t) = ^β(t)

T_U l ^T^β(t)

||U_l^Tβ(t)||

Detecting change point with SST Detecting change point with SST

H(t)^T G(t)^T

time Windows size

5/38

(6)

Our System for Scalable Anomaly Detection

 Event streams (sensor data) are distributed into multiple compute node s.

 Each event stream can be processed independently.

 Each compute node has GPUs for accelerating anomaly detection with SS T.

Gathering processed data



We used IBM

System S, which is

a middleware for

distributed stream

computing.

Sensor Data Sensor Data

GPU

(7)

Accelerating Anomaly Detection (SST) with G

PGPU

7/38



The most dominant part of SST is SVD (Singular Value Decomp

osition).



For accelerating SVD, we used CULA.

 CULA is a commercial library for accelerated linear algebra inclu ding SVD.



CULA is fast for large matrices (>500).

0 10 20 30 40 50 60

GPU CPU

Window Size (= Matrix Size N x N)

Time in Second

Processing Time of SST for a Single Change Point Score Processing Time of SST for a Single Change Point Score

better

(8)

Issues on Accelerating SST with GPGPU



_However…



GPU is slow for small matrice

s.



SST requires the small matrix co

mputations for the most applicatio

ns.



Moreover, CULA does not supp

ort computing multiple matrices

(i.e. multiple sensor data) in par

allel.



When you compute multiple sens

or data, processing time will incre

ase linearly.

0 10 20 30 40 50

60 GPU

CPU

Time in Second

0 1 2 3 4

5 GPU

CPU

Execution Time in sec

GPU is slow for small matrices.

better

(9)

Issues on Computing Small Matrix with GPUs



Why is the GPU slow for small matrices?



The larger matrix computation can fully utilize GPU cores, howe

ver the smaller matrix computation can not utilize GPU cores.

Large Matrix

Large Matrix Small MatrixSmall Matrix

Many Idle cores Many Idle

cores Utilization

GPU GPU

9/38

(10)

GPU Task Parallelism (GTP)



Processing multiple small matrices in parallel

小さい行列小さい行列

Prior Arts GPU Task Parallelism

Multiple small matrices

Small Matrix Small Matrix

(11)

Methods for GPU Task Parallelism



There are 2 methods for realizing GPU Task Parallelism.



(A): Using concurrent kernel execution

 Concurrent kernel execution is the feature of NVIDIA CUDA.

 which can execute multiple kernels (the maximum is 16 kernels) in parallel on a singl e GPU.

 The maximum number of kernels executed in parallel is 16, which may limit the performance if matrices are very small.



(B): Processing multiple matrices in a single kernel

 In this method, each CUDA thread-block compute single matrix.

 Since CUDA can execute many thread-block in parallel, single kernel can proces s multiple matrices.

 Since each CUDA thread-block is independently executed on a GPU, this metho d is also flexible.

 However, this method may require more development cost than method (A).



We used method (B) for accelerating SVD with small matrices.

11/38

(12)

Processing Flow of SVD with GPU Task Paralleli

sm

Data Transfer

CPU _GPU

①Input Matrix

②Bidiagonalization

③Bidiagonal Values

④Computation of singular values(∑)

⑤Processing Data

⑥Computation of singular vectors(U, V)

⑦Singular Vectors

Intermediate matrices of singular vectors

All our GPU kernels can compute

multiple matrices in parallel !!

All our GPU kernels can compute

multiple matrices in parallel !!

(13)

Performance Evaluation

13/38



We evaluated the 2 cases, processing 64 matrices in

parallel and 256 matrices in parallel and compared w

ith CPU version.



CPU version uses LAPACK and ATLAS.



Results are including data transfer time.



All computation is done with single precision floating point

values.



Environment



CPU: AMD Phenom X4 9850 (4 cores)



GPU: Tesla C1060



MEM:4GB, OS:Cent OS 5.4, CUDA 3.2

(14)

32 96 160 224 288 352 416 480 GPU/CPU 1 core (task=64)

GPU/CPU 1 core (task=256) GPU/CPU 4 cores (task=64)

Matrix Size (N x N)

Speedup

Performance Evaluation of our SVD with GPU Task

Parallelism



The number of “task” means the number of matrices processing in parallel.

GPU Compared with 1 CPU core

GPU is 3-4x faster!

GPU Compared with 4 CPU cores

Small Matrix

CPU:AMD Phenom X4 9850 Machine Spec:

(15)

Performance Comparison with CULA



For 256 x 256 matrix, CULA is slower than CPU but GPU with GTP is faster th

an CPU.

GPU CPU _{参考 :CULA}

0 0.2 0.4 0.6 0.8 1 1.2

0.05

0.37

s r E m ea tr S nt ve pe Pr e im T ng si es oc

1.06

better

CPU:AMD Phenom X4 9850 GPU:Tesla C1060

Matrix:

256x256

Matrix:

256x256

15/38

Machine Spec:

GPU with GTP GPU with CULA

Previous approach seconds

1 coreCPU

(16)

Evaluation of the Total System Performance



We evaluated maximum throughput of our system.



We used 4 nodes at maximum.



Each node has 1 socket of CPU and 4 GPU boards.

 CPU:AMD Phenom X4 9850(4 cores)

 GPU:GeForce 8800 GTS 512 x 4



We used System S for distributed processing.

Each node has 4 GPUs

Sensor Data from Disk Sensor Data from Disk

(17)

0 4 8 12 16 0

50 100 150 200 250 300 350

Number of GPUs/CPU cores

Throughput (# of scores/sec)

Evaluation of the Total System Performance

CPU:AMD Phenom X4 9850(4 cores) GPU:GeForce 8800 GTS 512 x 4

7x speed up

^compared

with CPU

Matrix:

_320x320

Matrix:

_320x320

17/38

Machine Spec:

(18)

Related Work



Accelerating SVD with GPGPU



Sheetal Lahabar, P. J. Narayanan. Singular value decomposition on G

PU using CUDA. IPDPS. 2009.



Their target is large matrices, not small matrices.



Task parallelism with CUDA

 Marisabel Guevara, et al. Enabling Task Parallelism in the CUDA Scheduler. P MEA, pp.69-76, September 2009.

 With their method, several CUDA kernel can be launched as a single kernel. C urrent CUDA supports the similar feature as concurrent kernel executions.



Real-time processing with GPGPUs



Sangjin Han, et al. PacketShader: a GPU-Accelerated Software Rout

er, SIGCOMM 2010, New Delhi, India, Aug 30 - Sep 3, 2010.



They proposed a novel framework for the GPU-accelerated softwar

e router.

(19)

Conclusions and Future Work



Conclusions



We proposed scalable anomaly detection with GPGPU.



There is a problem on accelerating SST with GPGPU.

 GPU is not efficient for small sized matrices.



We solved this problem by processing multiple matrices in parallel.

 We call this approach “GPU Task Parallelism”.



We developed GPU accelerated SST using GPU task parallelism.



Our performance evaluation shows that we succeeded to accelerate SST w

ith GPGPU.



Future Work



We need to evaluate the cost of implementing GPU task parallelized CUDA

kernel.

 We implemented our CUDA kernel by scratch. This is very high cost.

19/38

StreamGPU プロジェクト 鈴村研究室 大規模データ処理・ストリームコンピューティング StreamGPU HiPC2012

GPU Task Parallelism for Scalable Ano

maly Detection

Koji Ueno

Toyotaro Suzumura

Overview of Our Scalable Anomaly Detection

System

We propose GPU Task Parallelism to

accelerate many fine-grained small matrix

computations.

We propose GPU Task Parallelism to

accelerate many fine-grained small matrix

computations.

Background

Stream Computing for Real Time Application

Background

Motivating Scenario for Scalable System

Online car health monitoring service

Anomaly detection at the industrial factory

In both cases, the number of sensors may be more than 10

K – 1M.

Background

Anomaly Detection Algorithm SST

We focus on the anomaly det

ection from time series dat

a like sensor data.

SST

SVD

Our System for Scalable Anomaly Detection

Gathering processed data

We used IBM

System S, which is

a middleware for

distributed stream

computing.

GPU

Accelerating Anomaly Detection (SST) with G

PGPU

The most dominant part of SST is SVD (Singular Value Decomp

osition).

For accelerating SVD, we used CULA.

CULA is fast for large matrices (>500).

Issues on Accelerating SST with GPGPU

However…

GPU is slow for small matrice

s.

SST requires the small matrix co

mputations for the most applicatio

ns.

Moreover, CULA does not supp

ort computing multiple matrices

(i.e. multiple sensor data) in par

allel.

When you compute multiple sens

or data, processing time will incre

ase linearly.

Issues on Computing Small Matrix with GPUs

Why is the GPU slow for small matrices?

The larger matrix computation can fully utilize GPU cores, howe

ver the smaller matrix computation can not utilize GPU cores.

GPU Task Parallelism (GTP)

Processing multiple small matrices in parallel

Prior Arts GPU Task Parallelism

Multiple small matrices

Multiple small matrices

Methods for GPU Task Parallelism

There are 2 methods for realizing GPU Task Parallelism.

(A): Using concurrent kernel execution

(B): Processing multiple matrices in a single kernel

We used method (B) for accelerating SVD with small matrices.

Processing Flow of SVD with GPU Task Paralleli

sm

All our GPU kernels can compute

multiple matrices in parallel !!

All our GPU kernels can compute

multiple matrices in parallel !!

Performance Evaluation

We evaluated the 2 cases, processing 64 matrices in

parallel and 256 matrices in parallel and compared w

ith CPU version.

StreamGPU プロジェクト鈴村研究室大規模データ処理・ストリームコンピューティング StreamGPU HiPC2012

_SST

_However…

_320x320

_320x320