• 検索結果がありません。

StreamGPU プロジェクト 鈴村研究室 大規模データ処理・ストリームコンピューティング StreamGPU HiPC2012

N/A
N/A
Protected

Academic year: 2018

シェア "StreamGPU プロジェクト 鈴村研究室 大規模データ処理・ストリームコンピューティング StreamGPU HiPC2012"

Copied!
19
0
0

読み込み中.... (全文を見る)

全文

(1)

GPU Task Parallelism for Scalable Ano

maly Detection

Koji Ueno

12

Toyotaro Suzumura

123

1Tokyo Institute of Technology

2IBM Research – Tokyo 3JST CREST

(2)

Overview of Our Scalable Anomaly Detection

System

Car Sensors

Industrial Factories Videos and Images

Real time anomaly detection using GPUs Real time anomaly detection using GPUs

Sensor Data

We propose GPU Task Parallelism to

accelerate many fine-grained small matrix

computations.

We propose GPU Task Parallelism to

accelerate many fine-grained small matrix

computations.

Useful Information

(3)

Background

Stream Computing for Real Time Application

Stream computing (or data stream processing)

Realizes real time responses, in contrast to traditional batch process ing where the incoming data is received, then stored, and finally proc essed later.

Many important applications

Latency-critical anomaly detection

Algorithmic trading

Many kinds of middleware have been developed for stream processing.

IBM : System S

Yahoo! : S4

Retrieving valuable data Large amount of streaming data

Large amount of streaming data

Realizing real time application Realizing real time application

Stream Computing in Memory

3/38

(4)

Background

Motivating Scenario for Scalable System

Online car health monitoring service

Cars are equipped with health monitoring sensors for various c ar parts.

The central servers gather sensor data from the customers’ ca r and detects anomaly behavior.

Anomaly detection at the industrial factory

Sensors are attached to various parts of machines.

The central server gathers sensor data from the machines and d etects anomaly behavior.

In both cases, the number of sensors may be more than 10

K – 1M.

Processing anomaly detection algorithm will be the bottleneck for these applications.

(5)

Background

Anomaly Detection Algorithm SST

We focus on the anomaly det

ection from time series dat

a like sensor data.

SST

(Singular Spectrum Transforma tion)*

is one of change-Point detection algorithms, which can be used fo r anomaly detection.

can be used for various situatio ns with a few user defined param eters.

The most dominant part of SST is S VD (Singular Value Decomposition).

* Tsuyoshi Ide, et al. Knowledge Discovery from Heterogeneous Dynamic

Systems using Change-Point Correlations. 2005. SIAM International Conference on Data Mining (SDM 05).

SVD

Ul = [u1; u2; …; ul] β(t)

z(t) = β(t)

T U l Tβ(t)

||Ul Tβ(t)||

Detecting change point with SST Detecting change point with SST

H(t)T G(t)T

time Windows size

5/38

(6)

Our System for Scalable Anomaly Detection

Event streams (sensor data) are distributed into multiple compute node s.

Each event stream can be processed independently.

Each compute node has GPUs for accelerating anomaly detection with SS T.

Gathering processed data

We used IBM

System S, which is

a middleware for

distributed stream

computing.

Sensor Data Sensor Data

GPU

(7)

Accelerating Anomaly Detection (SST) with G

PGPU

7/38

The most dominant part of SST is SVD (Singular Value Decomp

osition).

For accelerating SVD, we used CULA.

CULA is a commercial library for accelerated linear algebra inclu ding SVD.

CULA is fast for large matrices (>500).

0 10 20 30 40 50 60

GPU CPU

Window Size (= Matrix Size N x N)

Time in Second

Processing Time of SST for a Single Change Point Score Processing Time of SST for a Single Change Point Score

better

(8)

Issues on Accelerating SST with GPGPU

However…

GPU is slow for small matrice

s.

SST requires the small matrix co

mputations for the most applicatio

ns.

Moreover, CULA does not supp

ort computing multiple matrices

(i.e. multiple sensor data) in par

allel.

When you compute multiple sens

or data, processing time will incre

ase linearly.

0 10 20 30 40 50

60 GPU

CPU

Window Size (= Matrix Size N x N)

Time in Second

0 1 2 3 4

5 GPU

CPU

Window Size (= Matrix Size N x N)

Execution Time in sec

GPU is slow for small matrices.

better

(9)

Issues on Computing Small Matrix with GPUs

Why is the GPU slow for small matrices?

The larger matrix computation can fully utilize GPU cores, howe

ver the smaller matrix computation can not utilize GPU cores.

Large Matrix

Large Matrix Small MatrixSmall Matrix

Many Idle cores Many Idle

cores Utilization

GPU GPU

9/38

(10)

GPU Task Parallelism (GTP)

Processing multiple small matrices in parallel

小さい行列 小さい行列

小さい行列 小さい行列

小さい行列 小さい行列

小さい行列 小さい行列

Prior Arts GPU Task Parallelism

Multiple small matrices

Multiple small matrices

Small Matrix Small Matrix

Small Matrix Small Matrix

(11)

Methods for GPU Task Parallelism

There are 2 methods for realizing GPU Task Parallelism.

(A): Using concurrent kernel execution

Concurrent kernel execution is the feature of NVIDIA CUDA.

which can execute multiple kernels (the maximum is 16 kernels) in parallel on a singl e GPU.

The maximum number of kernels executed in parallel is 16, which may limit the performance if matrices are very small.

(B): Processing multiple matrices in a single kernel

In this method, each CUDA thread-block compute single matrix.

Since CUDA can execute many thread-block in parallel, single kernel can proces s multiple matrices.

Since each CUDA thread-block is independently executed on a GPU, this metho d is also flexible.

However, this method may require more development cost than method (A).

We used method (B) for accelerating SVD with small matrices.

11/38

(12)

Processing Flow of SVD with GPU Task Paralleli

sm

Data Transfer

CPU GPU

①Input Matrix

②Bidiagonalization

③Bidiagonal Values

④Computation of singular values(∑)

⑤Processing Data

⑥Computation of singular vectors(U, V)

⑦Singular Vectors

Intermediate matrices of singular vectors

All our GPU kernels can compute

multiple matrices in parallel !!

All our GPU kernels can compute

multiple matrices in parallel !!

(13)

Performance Evaluation

13/38

We evaluated the 2 cases, processing 64 matrices in

parallel and 256 matrices in parallel and compared w

ith CPU version.

CPU version uses LAPACK and ATLAS.

Results are including data transfer time.

All computation is done with single precision floating point

values.

Environment

CPU: AMD Phenom X4 9850 (4 cores)

GPU: Tesla C1060

MEM:4GB, OS:Cent OS 5.4, CUDA 3.2

(14)

32 96 160 224 288 352 416 480 GPU/CPU 1 core (task=64)

GPU/CPU 1 core (task=256) GPU/CPU 4 cores (task=64)

Matrix Size (N x N)

Speedup

Performance Evaluation of our SVD with GPU Task

Parallelism

The number of “task” means the number of matrices processing in parallel.

GPU Compared with 1 CPU core

GPU is 3-4x faster!

GPU is 3-4x faster!

GPU Compared with 4 CPU cores

Small Matrix

CPU:AMD Phenom X4 9850 Machine Spec:

(15)

Performance Comparison with CULA

For 256 x 256 matrix, CULA is slower than CPU but GPU with GTP is faster th

an CPU.

GPU CPU 参考 :CULA

0 0.2 0.4 0.6 0.8 1 1.2

0.05

0.37

s r E m ea tr S nt ve pe Pr e im T ng si es oc

1.06

better

CPU:AMD Phenom X4 9850 GPU:Tesla C1060

Matrix:

256x256

Matrix:

256x256

15/38

Machine Spec:

GPU with GTP GPU with CULA

Previous approach seconds

1 coreCPU

(16)

Evaluation of the Total System Performance

We evaluated maximum throughput of our system.

We used 4 nodes at maximum.

Each node has 1 socket of CPU and 4 GPU boards.

CPU:AMD Phenom X4 9850(4 cores)

GPU:GeForce 8800 GTS 512 x 4

We used System S for distributed processing.

Each node has 4 GPUs

Sensor Data from Disk Sensor Data from Disk

(17)

0 4 8 12 16 0

50 100 150 200 250 300 350

Number of GPUs/CPU cores

Throughput (# of scores/sec)

Evaluation of the Total System Performance

CPU:AMD Phenom X4 9850(4 cores) GPU:GeForce 8800 GTS 512 x 4

7x speed up

compared

with CPU

Matrix:

320x320

Matrix:

320x320

17/38

Machine Spec:

(18)

Related Work

Accelerating SVD with GPGPU

Sheetal Lahabar, P. J. Narayanan. Singular value decomposition on G

PU using CUDA. IPDPS. 2009.

Their target is large matrices, not small matrices.

Task parallelism with CUDA

Marisabel Guevara, et al. Enabling Task Parallelism in the CUDA Scheduler. P MEA, pp.69-76, September 2009.

With their method, several CUDA kernel can be launched as a single kernel. C urrent CUDA supports the similar feature as concurrent kernel executions.

Real-time processing with GPGPUs

Sangjin Han, et al. PacketShader: a GPU-Accelerated Software Rout

er, SIGCOMM 2010, New Delhi, India, Aug 30 - Sep 3, 2010.

They proposed a novel framework for the GPU-accelerated softwar

e router.

(19)

Conclusions and Future Work

Conclusions

We proposed scalable anomaly detection with GPGPU.

There is a problem on accelerating SST with GPGPU.

GPU is not efficient for small sized matrices.

We solved this problem by processing multiple matrices in parallel.

We call this approach “GPU Task Parallelism”.

We developed GPU accelerated SST using GPU task parallelism.

Our performance evaluation shows that we succeeded to accelerate SST w

ith GPGPU.

Future Work

We need to evaluate the cost of implementing GPU task parallelized CUDA

kernel.

We implemented our CUDA kernel by scratch. This is very high cost.

19/38

参照

関連したドキュメント

Since all shift morphisms are bounded sliding block codes in the finite alphabet case, this implies that if K is any field, and if E and F are finite graphs with no sinks and

Then, since S 3 does not contain a punctured lens space with non-trivial fundamental group, we see that A 1 is boundary parallel in V 2 by Lemma C-3 (see the proof of Claim 1 in Case

(Furthermore, a bound on the number of elementary matrices can be found that depends only on n, and is universal for all fields.) In the case of fields, this can easily be

Minimum rank, Symmetric matrix, Finite field, Projective geometry, Polarity graph, Bilinear symmetric form.. AMS

In this way, many properties of lattices (as modularity, distributivity, ... and so on) can be characterized by properties of some hyperstructures associated to them.. Since E

Since each convexity ideal in question is σ -generated by closed sets, and there are exactly continuum many closed subsets of any perfect Polish space, each of these ideals

A set S of lines is universal for drawing planar graphs with n vertices if every planar graph G with n vertices can be drawn on S such that each vertex of G is drawn as a point on

Rose, “The index and the Drazin inverse of block triangular matrices,” SIAM Journal on Applied Mathematics, vol. Wang, “The reverse order law for the Drazin inverses of multiple