• 検索結果がありません。

R on super - - computers at computers at ISM ISM

N/A
N/A
Protected

Academic year: 2021

シェア "R on super - - computers at computers at ISM ISM "

Copied!
33
0
0

読み込み中.... (全文を見る)

全文

(1)

R on super

R on super - - computers at computers at ISM ISM

Junji

Junji Nakano Nakano

The Institute of Statistical Mathematics The Institute of Statistical Mathematics

Ei Ei - - ji ji Nakama Nakama

COM COM - - ONE Inc. ONE Inc.

(2)

Abstract Abstract

„ „ The institute of statistical mathematics (ISM) The institute of statistical mathematics (ISM)

„ „ is a (national) research institute for statistical sciences is a (national) research institute for statistical sciences

„ „ provides several super provides several super - - computer systems for statistical computer systems for statistical research by Japanese statisticians.

research by Japanese statisticians.

„ „ We use R on these super We use R on these super - - computers and have made computers and have made several parallel computing functions available on it.

several parallel computing functions available on it.

These functions include These functions include

„ „ parallel BLAS replacements such as ATLAS and GOTO parallel BLAS replacements such as ATLAS and GOTO library

library

„ „ snow package for implementing parallel R functions using snow package for implementing parallel R functions using MPI and other distributed computing techniques.

MPI and other distributed computing techniques.

„ „ We also implement a Web environment to use parallel R We also implement a Web environment to use parallel R easily.

easily.

(3)

R and super

R and super - - computers (1) computers (1)

„ „ R is used widely for developing statistical R is used widely for developing statistical theory and real data analysis

theory and real data analysis

„ „ R sometimes requires huge amount of R sometimes requires huge amount of calculation such as

calculation such as

„ „ Huge matrix manipulations Huge matrix manipulations

„ „ Bootstrap or MCMC calculation Bootstrap or MCMC calculation

„ „ Data mining Data mining

„ „ Super Super - - computers may be useful for this computers may be useful for this purpose

purpose

(4)

R and super

R and super - - computers (2) computers (2)

„ „ Current super Current super - - computers equip computers equip

„ „ Many CPUs Many CPUs

„ „ However, single CPU of super computers is not so However, single CPU of super computers is not so different from ones in personal computers

different from ones in personal computers

„ „ Large memory Large memory

„ „ Almost all super Almost all super - - computers at present use computers at present use parallel computing techniques

parallel computing techniques

„ „ We use R on our super We use R on our super - - computers for computers for utilizing these facilities

utilizing these facilities

(5)

Parallel computing systems (1) Parallel computing systems (1)

„ „ Traditional 1 CPU Traditional 1 CPU system

system

„ „ Shared memory system Shared memory system

„ „ Distributed memory Distributed memory system

system

Memory CPU

Memory CPU

Memory CPU

Memory CPU

Memory

CPU CPU CPU

(6)

Parallel computing systems (2) Parallel computing systems (2)

Memory CPU

Memory

CPU CPU CPU

Memory CPU

Memory CPU

Memory CPU

Time

Serial Section

Parallelizable Sections

Single Computer Shared Memory Distributed Memory

(7)

Parallel computing systems (3) Parallel computing systems (3)

„ „ Software tools for parallel computing Software tools for parallel computing

„ „ Shared memory system Shared memory system

„ „ Process fork, Process fork, thread thread

„ „ OpenMP OpenMP

„ „ Distributed memory system Distributed memory system

„ „ Message Passing Message Passing

„„

PVM (Parallel virtual machine) PVM (Parallel virtual machine)

„„

MPI MPI (Messing passing interface) (mainly) (Messing passing interface) (mainly)

„ „ HPF (High performance Fortran) HPF (High performance Fortran)

(8)

Super

Super - - computers at ISM (1): computers at ISM (1): ismaltx ismaltx

„ „ SGI ALtix3700 Super Cluster SGI ALtix3700 Super Cluster

„ „ CPU: Intel Itanium2 1.3GHz CPU: Intel Itanium2 1.3GHz

„

„

64 x 4 = 256 CPUs (Total max speed: 64 x 4 = 256 CPUs (Total max speed:

1331.2 GFLOPS) 1331.2 GFLOPS)

„ „ Memory: total 1920 GB Memory: total 1920 GB

„

„

512 GB x 3 + 384 GB = 1920 GB 512 GB x 3 + 384 GB = 1920 GB

„ „ OS: OS: RedHat RedHat Enterprise Linux v.3 (for Enterprise Linux v.3 (for SGI) SGI)

„ „ Physical random number generator Physical random number generator boards (16 set)

boards (16 set)

„ „ Frontend Frontend Computers For Parallel Computers For Parallel Computer

Computer

„„

SGI Altix3300 SGI Altix3300

„„

8 CPU (Intel Itanium2 1.3GHz) 8 CPU (Intel Itanium2 1.3GHz)

„„

32GB Memory 32GB Memory

(9)

Super

Super - - computers at ISM (2): computers at ISM (2): ismsr ismsr

„ „ HITACHI SR11000 HITACHI SR11000 Model H1

Model H1

„ „ CPU: IBM Power4+ CPU: IBM Power4+

1.7GHz 1.7GHz

„ „ 16 x 4 = 64 (Total max 16 x 4 = 64 (Total max speed: 435.2 GFLOPS) speed: 435.2 GFLOPS)

„ „ Memory: total 128 GB Memory: total 128 GB

„ „ 32 GB x 4 = 128 GB 32 GB x 4 = 128 GB

„ „ OS: (IBM) AIX 5L V5.2 OS: (IBM) AIX 5L V5.2

(10)

Super

Super - - computers at ISM (3): computers at ISM (3): ismxc ismxc

„ „ HP XC4000 System HP XC4000 System

„ „ CPU: AMD CPU: AMD Opteron Opteron 2.6GHz

2.6GHz

„ „ 2 x 128 = 256 CPUs (Total 2 x 128 = 256 CPUs (Total max speed: 1331.2

max speed: 1331.2 GFLOPS)

GFLOPS)

„ „ Memory: total 640 GB Memory: total 640 GB

„ „ 4 GB x 96 + 8 GB x 32 = 4 GB x 96 + 8 GB x 32 = 640 GB

640 GB

„ „ OS: RedHat OS: RedHat Enterprise Enterprise Linux v.3 (for HP)

Linux v.3 (for HP)

„ „ Physical random number Physical random number

generator boards (8 set)

generator boards (8 set)

(11)

BLAS BLAS

„ „ BLAS: Basic Linear Algebra Subprograms BLAS: Basic Linear Algebra Subprograms

„ „ high quality "building block" routines for performing high quality "building block" routines for performing basic vector and matrix operations

basic vector and matrix operations

„ „ Level 1 BLAS do vector Level 1 BLAS do vector - - vector operations vector operations

„ „ Level 2 BLAS do matrix Level 2 BLAS do matrix - - vector operations vector operations

„ „ Level 3 BLAS do matrix- Level 3 BLAS do matrix - matrix operations matrix operations

„ „ BLAS are efficient, portable, and widely available, BLAS are efficient, portable, and widely available, they're commonly used in

they're commonly used in the development of high quality linear algebra software, LINPACK and LAPACK for example

„ „ R uses BLAS for executing linear calculations R uses BLAS for executing linear calculations

(12)

ATLAS (1) ATLAS (1)

„ „ ATLAS: Automatically Tuned Linear ATLAS: Automatically Tuned Linear Algebra Software

Algebra Software

„ „ achieve performance on par with machine achieve performance on par with machine - - specific tuned libraries

specific tuned libraries

„ „ provides C and Fortran77 interfaces to a provides C and Fortran77 interfaces to a portably efficient

portably efficient BLAS BLAS implementation, as implementation, as well as a few routines from

well as a few routines from LAPACK LAPACK

„ „ Threads are available Threads are available

„ „ ATLAS FAQ says: ATLAS FAQ says:

(13)

ATLAS (2) ATLAS (2)

„„

ATLAS is used, or is planned to be used, in the following PSEs ATLAS is used, or is planned to be used, in the following PSEs: :

„„

MAPLE(v7 and higher) MAPLE (v7 and higher)

„„

MATLAB(v6.0 and higher) MATLAB (v6.0 and higher)

„„

Mathematica Mathematica (forthcoming) (forthcoming)

„

„

MAPLE(v7 and higher) MAPLE (v7 and higher)

„

„

Octave Octave

„

„

ATLAS is also included in: ATLAS is also included in:

„„

Absoft Absoft Pro Fortran Pro Fortran

„„

ATLAS may be optionally used by almost any project requiring the ATLAS may be optionally used by almost any project requiring the BLAS. BLAS.

Here are some projects that we have seen providing the option fo

Here are some projects that we have seen providing the option for using r using ATLAS:

ATLAS:

„„

GSL GSL

„„

HPL HPL

„

„

LAPACK LAPACK

„

„

MPB MPB

„„

The R Project The R Project

„„

Scientific Python Scientific Python

„„

Scilab Scilab

„„

PWSCF PWSCF

(14)

GOTO library GOTO library

„ „ Kazushige Goto's Kazushige Goto's library (libgoto library ( libgoto) is a highly tuned ) is a highly tuned blas blas library for various architectures. It is based on

library for various architectures. It is based on minimising minimising TLB misses. These BLAS libraries contains following

TLB misses. These BLAS libraries contains following functions

functions

„„

Level 1 Level 1

„„

Level 1 extentions Level 1 extentions

„

„

Level 2 Level 2

„„

Level 3 Level 3

„

„

a part of LAPACK slaswp a part of LAPACK slaswp, sgetf2, , sgetf2, sgetrf sgetrf, , sgetrs sgetrs , , sgesv, sgesv , dlaswp dlaswp , , dgetf2,

dgetf2, dgetrf dgetrf, , dgetrs, dgetrs , dgesv dgesv, , claswp, cgetf2, claswp , cgetf2, cgetrf cgetrf, , cgetrs cgetrs, , cgesv cgesv, , zlaswp

zlaswp, zgetf2, , zgetf2, zgetrf zgetrf , , zgetrs, zgetrs , zgesv. zgesv .

„ „ The library can be downloaded from The library can be downloaded from Goto's Goto's web web pages pages

„ „ Threads are available. Threads are available .

(15)

Execute ATLAS R on

Execute ATLAS R on ismaltx ismaltx (1) (1)

„

„

Login to Login to ismaltx.ism.ac.jp ismaltx.ism.ac.jp by telnet by telnet

„„

Many Rs Many Rs are stored in /usr/local1/ . are stored in /usr/local1/ .

„

„ Simple example of R program (ex1.R)Simple example of R program (ex1.R) {ismaltx 3}/usr/local1% ls

bin lib R-1.9.1.intel8 R-2.0.1.gccgoto dust log R-1.9.1.intel8atlas R-2.0.1.gccgotop etc man R-1.9.1.intel8goto R-2.0.1.intel8

example R-1.9.1 R-1.9.1.intel8gotop R-2.0.1.intel8atlas gcc-3.4.1 R-1.9.1.gcc R-1.9.1.intel8scs R-2.0.1.intel8atlasp gcc-3.4.2 R-1.9.1.gccatlas R-2.0.1.gcc R-2.0.1.intel8goto include R-1.9.1.gccgoto R-2.0.1.gccatlas R-2.0.1.intel8gotop info R-1.9.1.gccscs R-2.0.1.gccatlasp share

{ismaltx 4}/usr/local1%

matrixcalc <- function(m)

{ A<-array(rnorm(m^2),dim=c(m,m)) B<-array(rnorm(m^2),dim=c(m,m)) st<-proc.time()

A%*%B

et<-proc.time()

return((et-st)[3]) }

(16)

Execute ATLAS R on

Execute ATLAS R on ismaltx ismaltx (2) (2)

„

„

Shell file to choose a specific R (R.sh Shell file to choose a specific R ( R.sh) )

„

„

Write qsub Write qsub script for batch job (ex1.sh) script for batch job (ex1.sh)

#!/bin/sh RVER=2.0.1 COMP=intel8 BLAS=atlasp

export ATLAS_NUM_THREADS=8

export LD_LIBRARY_PATH=/usr/local1/lib/atlas

/atlas${ATLAS_NUM_THREADS}:${LD_LIBRARY_PATH}

/usr/local1/R-${RVER}.${COMP}${BLAS}/bin/R -q --no-save --args < $1

#!/bin/sh

#PBS -q q8

#PBS -j oe

#PBS -l ncpus=8

cd /home0/nakanoj/itanium/work umask 022

sh ./R.sh ex1.R

(17)

Execute ATLAS R on

Execute ATLAS R on

„„

Execute them on ismaltx Execute them on ismaltx ismaltx ismaltx (3) (3)

„„

ATLAS_NUM_THREADS=1 ATLAS_NUM_THREADS=1

{ismaltx 150}/home0/nakanoj/itanium/work% qsub ex1.sh 27404.ismaltx

{ismaltx 151}/home0/nakanoj/itanium/work% cat ex1.sh.o27404

> matrixcalc <- function(m)

+ { A<-array(rnorm(m^2),dim=c(m,m)) + B<-array(rnorm(m^2),dim=c(m,m)) + st<-proc.time()

+ A%*%B

+ et<-proc.time()

+ return((et-st)[3]) }

> print(matrixcalc(2048)) [1] 1.039062

>

>

{ismaltx 152}/home0/nakanoj/itanium/work%

{ismaltx 153}/home0/nakanoj/itanium/work% qsub ex1.sh 27405.ismaltx

{ismaltx 154}/home0/nakanoj/itanium/work% cat ex1.sh.o27405

> matrixcalc <- function(m)

+ { A<-array(rnorm(m^2),dim=c(m,m)) + return((et-st)[3]) }

(18)

ATLAS and GOTO on

ATLAS and GOTO on ismaltx ismaltx

0500100015002000

BLAS on ISMALTX

Matrix size

time(sec) 256 512 768 1024 1536 2048 2560 3072 3584 4096 5120 6144 7168 8192 10240 12288 14336 16384

ATLAS thread ATLAS goto

051015202530

BLAS on ISMALTX 256−4096

Matrix size

time(sec) 256 512 768 1024 1536 2048 2560 3072 3584 4096

ATLAS thread ATLAS goto

050010001500

GOTO on ISMALTX

time(sec)

1 thread 2 thread 4 thread 8 thread 16 thread 32 thread

0510152025

GOTO on ISMALTX 256−4096

time(sec)

1 thread 2 thread 4 thread 8 thread 16 thread 32 thread

(19)

0 5000 10000 15000 20000 25000 30000

020000400006000080000100000120000140000

R−2.2.1 + GotoBLAS−1.08 on IA64 Linux

N

MFLOPS

0 5000 10000 15000 20000 25000 30000

020000400006000080000100000120000140000

ismaltx4 gcc_GOTO_1 ismaltx3 gcc_GOTO_2 ismaltx3 gcc_GOTO_4 ismaltx3 gcc_GOTO_8 ismaltx4 gcc_GOTO_16 ismaltx2 gcc_GOTO_32 ismaltx2 gcc_GOTO_48 ismaltx2 gcc_GOTO_64

(20)

0 5000 10000 15000

010000200003000040000500006000070000

R−2.2.1 + GotoBLAS−1.02 on Power4+ AIX5.2L

N

MFLOPS

0 5000 10000 15000

010000200003000040000500006000070000

ismsr gcc_GOTO_1 srnd02 gcc_GOTO_2 ismsr gcc_GOTO_4 ismsr gcc_GOTO_8 ismsr gcc_GOTO_16

(21)

MPI MPI

„ „ De facto message passing library specification De facto message passing library specification

„ „ extended message extended message - - passing model passing model

„ „ not a language or compiler specification not a language or compiler specification

„ „ not a specific implementation or product not a specific implementation or product

„ „ For parallel computers, clusters, and (mainly) For parallel computers, clusters, and (mainly) homogeneous networks

homogeneous networks

„ „ Designed to provide access to advanced Designed to provide access to advanced

parallel hardware for end users, library writers, parallel hardware for end users, library writers,

and tool developers

and tool developers

(22)

MPI is Simple MPI is Simple

„ „ Many parallel programs can be written using just these Many parallel programs can be written using just these six functions, only two of which are non

six functions, only two of which are non- - trivial: trivial:

„ „ MPI_INIT MPI_INIT

„ „ MPI_FINALIZE MPI_FINALIZE

„ „ MPI_COMM_SIZE MPI_COMM_SIZE

„ „ MPI_COMM_RANK MPI_COMM_RANK

„ „ MPI_SEND MPI_SEND

„ „ MPI_RECV MPI_RECV

(23)

ScaLAPACK ScaLAPACK

„ „ ScaLAPACK ScaLAPACK (Scalable LAPACK) library (Scalable LAPACK) library

„

„

a subset of LAPACK a subset of LAPACK routines redesigned for distributed memory parallel routines redesigned for distributed memory parallel computers

computers

„„

It is portable on any computer that supports It is portable on any computer that supports MPI MPI or PVM. or PVM.

„„

Like Like LAPACK LAPACK , the ScaLAPACK , the ScaLAPACK routines are based on block- routines are based on block - partitioned partitioned algorithms in order to minimize the frequency of data movement b

algorithms in order to minimize the frequency of data movement between etween different levels of the memory hierarchy

different levels of the memory hierarchy

„„

The fundamental building blocks of the ScaLAPACK The fundamental building blocks of the ScaLAPACK library are library are distributed memory versions (

distributed memory versions (PBLAS PBLAS) of the ) of the Level 1, 2 and 3 BLAS Level 1, 2 and 3 BLAS, and , and a set of Basic Linear Algebra Communication Subprograms (

a set of Basic Linear Algebra Communication Subprograms (BLACS BLACS ) for ) for communication tasks that arise frequently in parallel linear alg

communication tasks that arise frequently in parallel linear alg ebra ebra computations

computations

„„

In the ScaLAPACK In the ScaLAPACK routines, all interprocessor routines, all interprocessor communication occurs communication occurs within the

within the PBLAS PBLAS and the BLACS and the BLACS

(24)

Execute

Execute RScaLAPACK RScaLAPACK on on ismaltx ismaltx

library(RScaLAPACK) m <- 2048

num_thread <- 4

matrixcalc<-function(m,thread) {

A<-array(rnorm(m^2),dim=c(m,m)) B<-array(rnorm(m^2),dim=c(m,m)) AB<-A%*%B

st<-proc.time()

sla.solve(A,AB,thread) et<-proc.time()

return((et-st)[3]) }

print(matrixcalc(m,num_thread))

„

„

Simple R example (ex2.R) Simple R example (ex2.R)

„„

Script for qsub Script for qsub (ex2.sh) (ex2.sh)

#!/bin/sh

#PBS -q q32

#PBS -j oe

#PBS -l ncpus=32

cd /home0/nakano/itanium/work umask 022

export LD_LIBRARY_PATH=/usr/local1/lib/atlas/atlas1:$LD_LIBRARY_PATH

(25)

RScaLapack

RScaLapack on on ismaltx ismaltx

020040060080010001200

RScaLAPACK on ISMALTX all

CPU

time(sec)

nomp 1 2 4 8 16 32 64

1024 ^ 2 2048 ^ 2 4096 ^ 2 6144 ^ 2 8192 ^ 2 10240 ^ 2

050100150200250

RScaLAPACK on ISMALTX 1024 − 6144

CPU

time(sec)

nomp 1 2 4 8 16 32 64

1024 ^ 2 2048 ^ 2 4096 ^ 2 6144 ^ 2

(26)

Simple Network of Workstations Simple Network of Workstations

(snow) for R (1) (snow) for R (1)

„ „ The package The package snow snow (Simple Network of (Simple Network of

Workstations) implements a simple mechanism Workstations) implements a simple mechanism

for using a workstation cluster for for using a workstation cluster for

“ “ embarrassingly parallel embarrassingly parallel ” ” computations in R computations in R

„ „ Three low level interfaces have been Three low level interfaces have been

implemented, one based on sockets, one using implemented, one based on sockets, one using

PVM via the

PVM via the rpvm rpvm package and one using MPI, package and one using MPI, via the

via the Rmpi Rmpi package package

(27)

Simple Network of Workstations Simple Network of Workstations

(snow) for R (2) (snow) for R (2)

> cl <- makeCluster(c("itasca", "owasso"), type = "SOCK")

> system.time(nuke.boot <-

+ boot(nuke.data, nuke.fun, R=999, m=1, + fit.pred=new.fit, x.pred=new.data)) [1] 26.32 0.71 27.02 0.00 0.00

> clusterEvalQ(cl, library(boot))

> system.time(cl.nuke.boot <-

+ clusterCall(cl,boot,nuke.data, nuke.fun, R=500, m=1, + fit.pred=new.fit, x.pred=new.data))

[1] 0.01 0.00 14.27 0.00 0.00

> clusterApply(cl, 1:2, get("+"), 3) [[1]]

[1] 4

[[2]]

(28)

Simple Network of Workstations Simple Network of Workstations

(snow) for R (2) (snow) for R (2)

require(snow) N<-1024

cl <- makeCluster(getClusterOption("spec")) A<-array(rnorm(N*N),dim=c(N,N))

B<-array(rnorm(N*N),dim=c(N,N)) C<-parMM(cl,A,B)

stopCluster(cl)

„„

Simple example R program using snow (ex3.R) Simple example R program using snow (ex3.R)

„„

qsub qsub script (ex3.sh) script (ex3.sh)

#!/bin/csh

#PBS -q q8

#PBS -l ncpus=8

#PBS -j oe

cd /home0/nakanoj/itanium/work

source /usr/local1/bin/env_local1.csh

mpiexec -boot N R CMD BATCH --no-save ex3.R

(29)

1000 2000 3000 4000 5000 6000

010002000300040005000

R−2.2.1 snow SOCK BLAS is GOTO_2

t(MFLOPS)

1000 2000 3000 4000 5000 6000

010002000300040005000

SOCK GOTO_2 2 SOCK GOTO_2 4 SOCK GOTO_2 8 SOCK GOTO_2 16 SOCK GOTO_2 32

(30)
(31)
(32)

Web interface for

Web interface for Parall Parall R (1) R (1)

„ „ Demo Demo … …

„ „ Purpose Purpose

„ „ Easy use without thinking Job scheduler Easy use without thinking Job scheduler

„ „ Same interface for three super- Same interface for three super -computers computers

„ „ Implementation Implementation

„ „ Web interface by Perl Web interface by Perl

„ „ Generating shell script for execution Generating shell script for execution

„ „ Daemon for communicating with Job scheduler by Daemon for communicating with Job scheduler by

C C

(33)

Web interface for

Web interface for Parall Parall R (2) R (2)

„ „ Difficulties that we faced Difficulties that we faced

„ „ Weak support for large memory in 64 bits R Weak support for large memory in 64 bits R

„ „ Security Security

„ „ R portability of C source code for various compiler R portability of C source code for various compiler

„ „ GNU Fortran for IA64 is too slow! GNU Fortran for IA64 is too slow!

参照

関連したドキュメント

Keywords: Convex order ; Fréchet distribution ; Median ; Mittag-Leffler distribution ; Mittag- Leffler function ; Stable distribution ; Stochastic order.. AMS MSC 2010: Primary 60E05

Light linear logic ( LLL , Girard 1998): subsystem of linear logic corresponding to polynomial time complexity.. Proofs of LLL precisely captures the polynomial time functions via

We provide an extension of the Fefferman-Phong inequality to nonnegative sym- bols whose fourth derivative belongs to a Wiener-type algebra of pseudodifferential operators introduced

In experiment 3, Figure 8 illustrates the results using the GAC 11, DRLSE 16, and PGBLSE models in the segmentation of malignant breast tumor in an US image.. The GAC model fails

In Section 3, we show that the clique- width is unbounded in any superfactorial class of graphs, and in Section 4, we prove that the clique-width is bounded in any hereditary

Inside this class, we identify a new subclass of Liouvillian integrable systems, under suitable conditions such Liouvillian integrable systems can have at most one limit cycle, and

Since we need information about the D-th derivative of f it will be convenient for us that an asymptotic formula for an analytic function in the form of a sum of analytic

A connection with partially asymmetric exclusion process (PASEP) Type B Permutation tableaux defined by Lam and Williams.. 4