R on super
R on super - - computers at computers at ISM ISM
Junji
Junji Nakano Nakano
The Institute of Statistical Mathematics The Institute of Statistical Mathematics
Ei Ei - - ji ji Nakama Nakama
COM COM - - ONE Inc. ONE Inc.
Abstract Abstract
The institute of statistical mathematics (ISM) The institute of statistical mathematics (ISM)
is a (national) research institute for statistical sciences is a (national) research institute for statistical sciences
provides several super provides several super - - computer systems for statistical computer systems for statistical research by Japanese statisticians.
research by Japanese statisticians.
We use R on these super We use R on these super - - computers and have made computers and have made several parallel computing functions available on it.
several parallel computing functions available on it.
These functions include These functions include
parallel BLAS replacements such as ATLAS and GOTO parallel BLAS replacements such as ATLAS and GOTO library
library
snow package for implementing parallel R functions using snow package for implementing parallel R functions using MPI and other distributed computing techniques.
MPI and other distributed computing techniques.
We also implement a Web environment to use parallel R We also implement a Web environment to use parallel R easily.
easily.
R and super
R and super - - computers (1) computers (1)
R is used widely for developing statistical R is used widely for developing statistical theory and real data analysis
theory and real data analysis
R sometimes requires huge amount of R sometimes requires huge amount of calculation such as
calculation such as
Huge matrix manipulations Huge matrix manipulations
Bootstrap or MCMC calculation Bootstrap or MCMC calculation
Data mining Data mining
Super Super - - computers may be useful for this computers may be useful for this purpose
purpose
R and super
R and super - - computers (2) computers (2)
Current super Current super - - computers equip computers equip
Many CPUs Many CPUs
However, single CPU of super computers is not so However, single CPU of super computers is not so different from ones in personal computers
different from ones in personal computers
Large memory Large memory
Almost all super Almost all super - - computers at present use computers at present use parallel computing techniques
parallel computing techniques
We use R on our super We use R on our super - - computers for computers for utilizing these facilities
utilizing these facilities
Parallel computing systems (1) Parallel computing systems (1)
Traditional 1 CPU Traditional 1 CPU system
system
Shared memory system Shared memory system
Distributed memory Distributed memory system
system
Memory CPU
Memory CPU
Memory CPU
Memory CPU
Memory
CPU CPU CPU
Parallel computing systems (2) Parallel computing systems (2)
Memory CPU
Memory
CPU CPU CPU
Memory CPU
Memory CPU
Memory CPU
Time
Serial Section
Parallelizable Sections
Single Computer Shared Memory Distributed Memory
Parallel computing systems (3) Parallel computing systems (3)
Software tools for parallel computing Software tools for parallel computing
Shared memory system Shared memory system
Process fork, Process fork, thread thread
OpenMP OpenMP
Distributed memory system Distributed memory system
Message Passing Message Passing
PVM (Parallel virtual machine) PVM (Parallel virtual machine)
MPI MPI (Messing passing interface) (mainly) (Messing passing interface) (mainly)
HPF (High performance Fortran) HPF (High performance Fortran)
Super
Super - - computers at ISM (1): computers at ISM (1): ismaltx ismaltx
SGI ALtix3700 Super Cluster SGI ALtix3700 Super Cluster
CPU: Intel Itanium2 1.3GHz CPU: Intel Itanium2 1.3GHz
64 x 4 = 256 CPUs (Total max speed: 64 x 4 = 256 CPUs (Total max speed:
1331.2 GFLOPS) 1331.2 GFLOPS)
Memory: total 1920 GB Memory: total 1920 GB
512 GB x 3 + 384 GB = 1920 GB 512 GB x 3 + 384 GB = 1920 GB
OS: OS: RedHat RedHat Enterprise Linux v.3 (for Enterprise Linux v.3 (for SGI) SGI)
Physical random number generator Physical random number generator boards (16 set)
boards (16 set)
Frontend Frontend Computers For Parallel Computers For Parallel Computer
Computer
SGI Altix3300 SGI Altix3300
8 CPU (Intel Itanium2 1.3GHz) 8 CPU (Intel Itanium2 1.3GHz)
32GB Memory 32GB Memory
Super
Super - - computers at ISM (2): computers at ISM (2): ismsr ismsr
HITACHI SR11000 HITACHI SR11000 Model H1
Model H1
CPU: IBM Power4+ CPU: IBM Power4+
1.7GHz 1.7GHz
16 x 4 = 64 (Total max 16 x 4 = 64 (Total max speed: 435.2 GFLOPS) speed: 435.2 GFLOPS)
Memory: total 128 GB Memory: total 128 GB
32 GB x 4 = 128 GB 32 GB x 4 = 128 GB
OS: (IBM) AIX 5L V5.2 OS: (IBM) AIX 5L V5.2
Super
Super - - computers at ISM (3): computers at ISM (3): ismxc ismxc
HP XC4000 System HP XC4000 System
CPU: AMD CPU: AMD Opteron Opteron 2.6GHz
2.6GHz
2 x 128 = 256 CPUs (Total 2 x 128 = 256 CPUs (Total max speed: 1331.2
max speed: 1331.2 GFLOPS)
GFLOPS)
Memory: total 640 GB Memory: total 640 GB
4 GB x 96 + 8 GB x 32 = 4 GB x 96 + 8 GB x 32 = 640 GB
640 GB
OS: RedHat OS: RedHat Enterprise Enterprise Linux v.3 (for HP)
Linux v.3 (for HP)
Physical random number Physical random number
generator boards (8 set)
generator boards (8 set)
BLAS BLAS
BLAS: Basic Linear Algebra Subprograms BLAS: Basic Linear Algebra Subprograms
high quality "building block" routines for performing high quality "building block" routines for performing basic vector and matrix operations
basic vector and matrix operations
Level 1 BLAS do vector Level 1 BLAS do vector - - vector operations vector operations
Level 2 BLAS do matrix Level 2 BLAS do matrix - - vector operations vector operations
Level 3 BLAS do matrix- Level 3 BLAS do matrix - matrix operations matrix operations
BLAS are efficient, portable, and widely available, BLAS are efficient, portable, and widely available, they're commonly used in
they're commonly used in the development of high quality linear algebra software, LINPACK and LAPACK for example
R uses BLAS for executing linear calculations R uses BLAS for executing linear calculations
ATLAS (1) ATLAS (1)
ATLAS: Automatically Tuned Linear ATLAS: Automatically Tuned Linear Algebra Software
Algebra Software
achieve performance on par with machine achieve performance on par with machine - - specific tuned libraries
specific tuned libraries
provides C and Fortran77 interfaces to a provides C and Fortran77 interfaces to a portably efficient
portably efficient BLAS BLAS implementation, as implementation, as well as a few routines from
well as a few routines from LAPACK LAPACK
Threads are available Threads are available
ATLAS FAQ says: ATLAS FAQ says:
ATLAS (2) ATLAS (2)
ATLAS is used, or is planned to be used, in the following PSEs ATLAS is used, or is planned to be used, in the following PSEs: :
MAPLE(v7 and higher) MAPLE (v7 and higher)
MATLAB(v6.0 and higher) MATLAB (v6.0 and higher)
Mathematica Mathematica (forthcoming) (forthcoming)
MAPLE(v7 and higher) MAPLE (v7 and higher)
Octave Octave
ATLAS is also included in: ATLAS is also included in:
Absoft Absoft Pro Fortran Pro Fortran
ATLAS may be optionally used by almost any project requiring the ATLAS may be optionally used by almost any project requiring the BLAS. BLAS.
Here are some projects that we have seen providing the option fo
Here are some projects that we have seen providing the option for using r using ATLAS:
ATLAS:
GSL GSL
HPL HPL
LAPACK LAPACK
MPB MPB
The R Project The R Project
Scientific Python Scientific Python
Scilab Scilab
PWSCF PWSCF
GOTO library GOTO library
Kazushige Goto's Kazushige Goto's library (libgoto library ( libgoto) is a highly tuned ) is a highly tuned blas blas library for various architectures. It is based on
library for various architectures. It is based on minimising minimising TLB misses. These BLAS libraries contains following
TLB misses. These BLAS libraries contains following functions
functions
Level 1 Level 1
Level 1 extentions Level 1 extentions
Level 2 Level 2
Level 3 Level 3
a part of LAPACK slaswp a part of LAPACK slaswp, sgetf2, , sgetf2, sgetrf sgetrf, , sgetrs sgetrs , , sgesv, sgesv , dlaswp dlaswp , , dgetf2,
dgetf2, dgetrf dgetrf, , dgetrs, dgetrs , dgesv dgesv, , claswp, cgetf2, claswp , cgetf2, cgetrf cgetrf, , cgetrs cgetrs, , cgesv cgesv, , zlaswp
zlaswp, zgetf2, , zgetf2, zgetrf zgetrf , , zgetrs, zgetrs , zgesv. zgesv .
The library can be downloaded from The library can be downloaded from Goto's Goto's web web pages pages
Threads are available. Threads are available .
Execute ATLAS R on
Execute ATLAS R on ismaltx ismaltx (1) (1)
Login to Login to ismaltx.ism.ac.jp ismaltx.ism.ac.jp by telnet by telnet
Many Rs Many Rs are stored in /usr/local1/ . are stored in /usr/local1/ .
Simple example of R program (ex1.R)Simple example of R program (ex1.R) {ismaltx 3}/usr/local1% ls
bin lib R-1.9.1.intel8 R-2.0.1.gccgoto dust log R-1.9.1.intel8atlas R-2.0.1.gccgotop etc man R-1.9.1.intel8goto R-2.0.1.intel8
example R-1.9.1 R-1.9.1.intel8gotop R-2.0.1.intel8atlas gcc-3.4.1 R-1.9.1.gcc R-1.9.1.intel8scs R-2.0.1.intel8atlasp gcc-3.4.2 R-1.9.1.gccatlas R-2.0.1.gcc R-2.0.1.intel8goto include R-1.9.1.gccgoto R-2.0.1.gccatlas R-2.0.1.intel8gotop info R-1.9.1.gccscs R-2.0.1.gccatlasp share
{ismaltx 4}/usr/local1%
matrixcalc <- function(m)
{ A<-array(rnorm(m^2),dim=c(m,m)) B<-array(rnorm(m^2),dim=c(m,m)) st<-proc.time()
A%*%B
et<-proc.time()
return((et-st)[3]) }
Execute ATLAS R on
Execute ATLAS R on ismaltx ismaltx (2) (2)
Shell file to choose a specific R (R.sh Shell file to choose a specific R ( R.sh) )
Write qsub Write qsub script for batch job (ex1.sh) script for batch job (ex1.sh)
#!/bin/sh RVER=2.0.1 COMP=intel8 BLAS=atlasp
export ATLAS_NUM_THREADS=8
export LD_LIBRARY_PATH=/usr/local1/lib/atlas
/atlas${ATLAS_NUM_THREADS}:${LD_LIBRARY_PATH}
/usr/local1/R-${RVER}.${COMP}${BLAS}/bin/R -q --no-save --args < $1
#!/bin/sh
#PBS -q q8
#PBS -j oe
#PBS -l ncpus=8
cd /home0/nakanoj/itanium/work umask 022
sh ./R.sh ex1.R
Execute ATLAS R on
Execute ATLAS R on
Execute them on ismaltx Execute them on ismaltx ismaltx ismaltx (3) (3)
ATLAS_NUM_THREADS=1 ATLAS_NUM_THREADS=1
{ismaltx 150}/home0/nakanoj/itanium/work% qsub ex1.sh 27404.ismaltx
{ismaltx 151}/home0/nakanoj/itanium/work% cat ex1.sh.o27404
> matrixcalc <- function(m)
+ { A<-array(rnorm(m^2),dim=c(m,m)) + B<-array(rnorm(m^2),dim=c(m,m)) + st<-proc.time()
+ A%*%B
+ et<-proc.time()
+ return((et-st)[3]) }
> print(matrixcalc(2048)) [1] 1.039062
>
>
{ismaltx 152}/home0/nakanoj/itanium/work%
{ismaltx 153}/home0/nakanoj/itanium/work% qsub ex1.sh 27405.ismaltx
{ismaltx 154}/home0/nakanoj/itanium/work% cat ex1.sh.o27405
> matrixcalc <- function(m)
+ { A<-array(rnorm(m^2),dim=c(m,m)) + return((et-st)[3]) }
ATLAS and GOTO on
ATLAS and GOTO on ismaltx ismaltx
0500100015002000
BLAS on ISMALTX
Matrix size
time(sec) 256 512 768 1024 1536 2048 2560 3072 3584 4096 5120 6144 7168 8192 10240 12288 14336 16384
ATLAS thread ATLAS goto
051015202530
BLAS on ISMALTX 256−4096
Matrix size
time(sec) 256 512 768 1024 1536 2048 2560 3072 3584 4096
ATLAS thread ATLAS goto
050010001500
GOTO on ISMALTX
time(sec)
1 thread 2 thread 4 thread 8 thread 16 thread 32 thread
0510152025
GOTO on ISMALTX 256−4096
time(sec)
1 thread 2 thread 4 thread 8 thread 16 thread 32 thread
0 5000 10000 15000 20000 25000 30000
020000400006000080000100000120000140000
R−2.2.1 + GotoBLAS−1.08 on IA64 Linux
N
MFLOPS
0 5000 10000 15000 20000 25000 30000
020000400006000080000100000120000140000
ismaltx4 gcc_GOTO_1 ismaltx3 gcc_GOTO_2 ismaltx3 gcc_GOTO_4 ismaltx3 gcc_GOTO_8 ismaltx4 gcc_GOTO_16 ismaltx2 gcc_GOTO_32 ismaltx2 gcc_GOTO_48 ismaltx2 gcc_GOTO_64
0 5000 10000 15000
010000200003000040000500006000070000
R−2.2.1 + GotoBLAS−1.02 on Power4+ AIX5.2L
N
MFLOPS
0 5000 10000 15000
010000200003000040000500006000070000
ismsr gcc_GOTO_1 srnd02 gcc_GOTO_2 ismsr gcc_GOTO_4 ismsr gcc_GOTO_8 ismsr gcc_GOTO_16
MPI MPI
De facto message passing library specification De facto message passing library specification
extended message extended message - - passing model passing model
not a language or compiler specification not a language or compiler specification
not a specific implementation or product not a specific implementation or product
For parallel computers, clusters, and (mainly) For parallel computers, clusters, and (mainly) homogeneous networks
homogeneous networks
Designed to provide access to advanced Designed to provide access to advanced
parallel hardware for end users, library writers, parallel hardware for end users, library writers,
and tool developers
and tool developers
MPI is Simple MPI is Simple
Many parallel programs can be written using just these Many parallel programs can be written using just these six functions, only two of which are non
six functions, only two of which are non- - trivial: trivial:
MPI_INIT MPI_INIT
MPI_FINALIZE MPI_FINALIZE
MPI_COMM_SIZE MPI_COMM_SIZE
MPI_COMM_RANK MPI_COMM_RANK
MPI_SEND MPI_SEND
MPI_RECV MPI_RECV
ScaLAPACK ScaLAPACK
ScaLAPACK ScaLAPACK (Scalable LAPACK) library (Scalable LAPACK) library
a subset of LAPACK a subset of LAPACK routines redesigned for distributed memory parallel routines redesigned for distributed memory parallel computers
computers
It is portable on any computer that supports It is portable on any computer that supports MPI MPI or PVM. or PVM.
Like Like LAPACK LAPACK , the ScaLAPACK , the ScaLAPACK routines are based on block- routines are based on block - partitioned partitioned algorithms in order to minimize the frequency of data movement b
algorithms in order to minimize the frequency of data movement between etween different levels of the memory hierarchy
different levels of the memory hierarchy
The fundamental building blocks of the ScaLAPACK The fundamental building blocks of the ScaLAPACK library are library are distributed memory versions (
distributed memory versions (PBLAS PBLAS) of the ) of the Level 1, 2 and 3 BLAS Level 1, 2 and 3 BLAS, and , and a set of Basic Linear Algebra Communication Subprograms (
a set of Basic Linear Algebra Communication Subprograms (BLACS BLACS ) for ) for communication tasks that arise frequently in parallel linear alg
communication tasks that arise frequently in parallel linear alg ebra ebra computations
computations
In the ScaLAPACK In the ScaLAPACK routines, all interprocessor routines, all interprocessor communication occurs communication occurs within the
within the PBLAS PBLAS and the BLACS and the BLACS
Execute
Execute RScaLAPACK RScaLAPACK on on ismaltx ismaltx
library(RScaLAPACK) m <- 2048
num_thread <- 4
matrixcalc<-function(m,thread) {
A<-array(rnorm(m^2),dim=c(m,m)) B<-array(rnorm(m^2),dim=c(m,m)) AB<-A%*%B
st<-proc.time()
sla.solve(A,AB,thread) et<-proc.time()
return((et-st)[3]) }
print(matrixcalc(m,num_thread))
Simple R example (ex2.R) Simple R example (ex2.R)
Script for qsub Script for qsub (ex2.sh) (ex2.sh)
#!/bin/sh
#PBS -q q32
#PBS -j oe
#PBS -l ncpus=32
cd /home0/nakano/itanium/work umask 022
export LD_LIBRARY_PATH=/usr/local1/lib/atlas/atlas1:$LD_LIBRARY_PATH
RScaLapack
RScaLapack on on ismaltx ismaltx
020040060080010001200
RScaLAPACK on ISMALTX all
CPU
time(sec)
nomp 1 2 4 8 16 32 64
1024 ^ 2 2048 ^ 2 4096 ^ 2 6144 ^ 2 8192 ^ 2 10240 ^ 2
050100150200250
RScaLAPACK on ISMALTX 1024 − 6144
CPU
time(sec)
nomp 1 2 4 8 16 32 64
1024 ^ 2 2048 ^ 2 4096 ^ 2 6144 ^ 2
Simple Network of Workstations Simple Network of Workstations
(snow) for R (1) (snow) for R (1)
The package The package snow snow (Simple Network of (Simple Network of
Workstations) implements a simple mechanism Workstations) implements a simple mechanism
for using a workstation cluster for for using a workstation cluster for
“ “ embarrassingly parallel embarrassingly parallel ” ” computations in R computations in R
Three low level interfaces have been Three low level interfaces have been
implemented, one based on sockets, one using implemented, one based on sockets, one using
PVM via the
PVM via the rpvm rpvm package and one using MPI, package and one using MPI, via the
via the Rmpi Rmpi package package
Simple Network of Workstations Simple Network of Workstations
(snow) for R (2) (snow) for R (2)
> cl <- makeCluster(c("itasca", "owasso"), type = "SOCK")
> system.time(nuke.boot <-
+ boot(nuke.data, nuke.fun, R=999, m=1, + fit.pred=new.fit, x.pred=new.data)) [1] 26.32 0.71 27.02 0.00 0.00
> clusterEvalQ(cl, library(boot))
> system.time(cl.nuke.boot <-
+ clusterCall(cl,boot,nuke.data, nuke.fun, R=500, m=1, + fit.pred=new.fit, x.pred=new.data))
[1] 0.01 0.00 14.27 0.00 0.00
> clusterApply(cl, 1:2, get("+"), 3) [[1]]
[1] 4
[[2]]
Simple Network of Workstations Simple Network of Workstations
(snow) for R (2) (snow) for R (2)
require(snow) N<-1024
cl <- makeCluster(getClusterOption("spec")) A<-array(rnorm(N*N),dim=c(N,N))
B<-array(rnorm(N*N),dim=c(N,N)) C<-parMM(cl,A,B)
stopCluster(cl)
Simple example R program using snow (ex3.R) Simple example R program using snow (ex3.R)
qsub qsub script (ex3.sh) script (ex3.sh)
#!/bin/csh
#PBS -q q8
#PBS -l ncpus=8
#PBS -j oe
cd /home0/nakanoj/itanium/work
source /usr/local1/bin/env_local1.csh
mpiexec -boot N R CMD BATCH --no-save ex3.R
1000 2000 3000 4000 5000 6000
010002000300040005000
R−2.2.1 snow SOCK BLAS is GOTO_2
t(MFLOPS)
1000 2000 3000 4000 5000 6000
010002000300040005000
SOCK GOTO_2 2 SOCK GOTO_2 4 SOCK GOTO_2 8 SOCK GOTO_2 16 SOCK GOTO_2 32