Two specific optimization techniques for speed and data management
Stefano M. Iacus
University of Milan - Italy
Two specific “not-to-loose performance”
techniques for speed and data management
Stefano M. Iacus
University of Milan - Italy
Japanese R Users’ Meeting
ISM - Tokyo, 8-12-2006
It is common knowledge that in R this sort of things
> x < - 0
> for ( i in 1 : 1 0 0 0 0 ) {
> x < - x + i
> }
> x
[1] 5 0 0 0 5 0 0 0
should be replaced by things like
> x < - sum ( 1 : 1 0 0 0 0 )
> x
[1] 5 0 0 0 5 0 0 0
Few R users knows why.
One of the main reasons for this is that R language uses a pass-by-value semantics
instead of a
pass-by-reference semantics
In a pass-by-reference semantics (C, etc.) arguments are passed to functions as pointers
In a pass-by-value semantics (S, R) arguments are passed to func- tions as copies
In a pass-by-reference semantics only an address (the pointer) is passed to the function and it is responsibility of the developer to correctly handle the memory corresponding to the object
v o i d m y F u n c ( d o u b l e * vec ){
if ( vec ! = N U L L )
f r e e ( vec ); / * agh ! * / }
v o i d m a i n ( v o i d ){
d o u b l e * m y v e c ;
m y v e c = m a l l o c (10 * s i z e o f ( d o u b l e ));
m y F u n c ( m y v e c );
m y v e c [2] = 1 . 0 ; / * p o t e n t i a l m e m o r y c o r r u p t i o n * / }
In a pass-by-value semantics a complete copy of each argument is passed to the function (this may be a big amount of memory) but potentially no risk to alter the original object
> m y F u n c < - f u n c t i o n ( vec ){
+ rm ( vec ) # d o e s not a f f e c t the o r i g i n a l m y v e c + }
>
> m y v e c < - n u m e r i c ( 1 0 )
> m y F u n c ( m y v e c )
> m y v e c [2] < - 1.0 # m y v e c is s t i l l a l i v e
> m y v e c
[1] 0 1 0 0 0 0 0 0 0 0
The main point behind the pass-by-value semantics is that S/R is an interactive language, thought mainly to experiment with data by incremental adjustments (development of new statistical models).
Indeed, if you have some data to analyze, you don’t want them to be altered by some subroutine! You may want to use them for further analyses.
This is why R is F-OOP (function based Object Oriented Language) and not C-OOP (class based OOP).
And yes, OOP it is not a synonym for C++ or Java.
Coming back to our original problem x < - 0
for ( i in 1 : 1 0 0 0 0 ) { x < - x + i
}
the statement x <- x + i involves the following steps:
• copy x and i
• pass these copies to the operator +, i.e. the function +(·,·)
• add this two values
• send back the result
• assign this result to x
• release the allocated copies of i and x.
then take a breath and iterate this again 9999 times!
There are cases in which instead of two numbers x and i you want to move a big object, like a data frame or a distance matrix, or when vectorization or the use ∗apply functions are not a solution and true for loops are required.
These are the cases of the sde and rrp packages which we will use as driving examples to describe two R programming techniques.
The sde package is about the simulation of solutions of Stochastic Differential Equations (SDE).
Solution to SDE’s are stochastic processes which are used in many contexts to describe stochastic evolutions with respect to time (fi- nancial time series, population dynamics, etc).
They looks like ordinary differential equations df(t) = g(f(t))dt plus a stochastic term
dX(t) = µ(X(t))dt + σ(X(t))dW(t)
where dW(t) is the “variation” of the additional stochastic noise
In the simplest case, it is assumed that dW(t) ∼ N(0,dt) so the following SDE
dX(t) = µ(X(t))dt + σ(X(t))dW(t) is discretized as follows
X(ti + ∆) = µ(X(ti))∆ + σ(X(ti))√
∆ ∗ Zi with Zi i.i.d. N(0,1). The R counterpart looks like
X [ i +1] < - X [ i ] + mu ( X [ i ]) * D e l t a + s i g m a ( X [ i ]) * s q r t ( D e l t a ) * Z [ i ]
using the notation Xi = X(ti)
Discretization schemes for SDE a true iterative algorithms, so they require pure for loops
Let’s try to force vectorization: consider the very simple SDE dXt = (θ1 − θ2Xt)dt + θ3dWt
Its discretization, after some manipulation, can be written as X(k) = X(1)· (1−θ2∆)k−1 +
k−1 X
j=1
(θ3 ∗√
∆ ∗Z(j))∗(1 −θ2∆)(k−j−1) Using linear algebra, this one can be vectorized as follows
t h e t a < - c (0 , 5 , 3 . 5 ) N < - 100
T < - 1 x < - 10
Z < - r n o r m ( N ) Dt < - 1 / N
A < - t h e t a [3] * s q r t ( Dt ) * Z
P < - (1 - t h e t a [2] * Dt ) ^ ( 0 : ( N - 1 ) ) X0 < - x
#
# and we o n l y n e e d one ‘ sapply ’ f u n c t i o n i n s t e a d of a for ()
#
X < - s a p p l y ( 2 : ( N +1) , f u n c t i o n ( x ) X0 * (1 - t h e t a [2] * Dt )^( x -1) + A [ 1 : ( x - 1 ) ] % * % P [( x - 1 ) : 1 ] )
The for loop version is as follows N < - 100
T < - 1 x < - 10
t h e t a < - c (0 , 5 , 3 . 5 ) Dt < - 1 / N
X < - n u m e r i c ( N +1) X [1] < - x
Z < - r n o r m ( N ) for ( i in 1: N )
X [ i +1] < - X [ i ] + ( t h e t a [1] - t h e t a [2] * X [ i ]) * Dt + t h e t a [3] * s q r t ( Dt ) * Z [ i ]
But, the one-line-hyper-vectorized version (OU.vec) is much slower than the standard for loop version (OU)
# d u m m y for - l o o p
> s y s t e m . t i m e ( OU (1 0 , 5 , 3 . 5) )
[1] 0 . 0 3 7 0 . 0 0 1 0 . 0 4 4 0 . 0 0 0 0 . 0 0 0
# one - line - hyper - v e c t o r i z e d
> s y s t e m . t i m e ( OU . vec (1 0 , 5 , 3 . 5) ) [1] 0 . 1 9 8 0 . 0 2 4 0 . 2 6 1 0 . 0 0 0 0 . 0 0 0
So maybe sometime the bottleneck is the user and not R!
Still vectorization is one of the key features of R if correctly used.
In particular, vectorization is effective only when the number of elementary calculations is not increased because of vectrorization (which was the case of the previous example)
Next is an example of good vectorization for the simulation of a path of the standard Brownian motion
BM .1 < - f u n c t i o n ( N = 1 0 0 0 0 ) { # b r u t a l c o d e W < - N U L L
for ( i in 2:( N + 1 ) ) # t e r r i b l e ! N e v e r do the f o l l o w i n g W < - c ( W , r n o r m (1) / s q r t ( N ))
W }
BM .2 < - f u n c t i o n ( N = 1 0 0 0 0 ) { # smarter , we a l l o c a t e f i r s t W < - n u m e r i c ( N +1)
Z < - r n o r m ( N )
for ( i in 2:( N + 1 ) )
W [ i ] < - W [ i -1] + Z [ i -1] / s q r t ( N ) W
}
BM . vec < - f u n c t i o n ( N = 1 0 0 0 0 ) # A W E S O M E ! c (0 , c u m s u m ( r n o r m ( N ) / s q r t ( N )))
This is the result
> set . s e e d ( 1 2 3 )
> s y s t e m . t i m e ( BM . 1 ( ) )
[1] 1 . 3 5 4 1 . 7 0 8 3 . 8 5 6 0 . 0 0 0 0 . 0 0 0
> set . s e e d ( 1 2 3 )
> s y s t e m . t i m e ( BM . 2 ( ) )
[1] 0 . 2 8 1 0 . 0 1 1 0 . 3 4 7 0 . 0 0 0 0 . 0 0 0
> set . s e e d ( 1 2 3 )
> s y s t e m . t i m e ( BM . vec ())
[1] 0 . 0 0 8 0 . 0 0 1 0 . 0 1 0 0 . 0 0 0 0 . 0 0 0
In usual Monte Carlo experiments for SDE we need to simulate paths longer than N = 100 steps and run at least 10000 replications.
So R for loops might make the analysis too slow.
It is easy in R to go at C level using .Call primitives and .Call do not pass copies of the object but pointers to the R object. There is still the need to allow for a generic specification of the model
dX(t) = µ(X(t))dt + σ(X(t))dW(t)
which means that we want to pass the user specified R functions µ and σ to the C subroutine.
go back to R
- -
C subroutine
execute a C for loop{
}
eval mu and sigma in R
R R
> sigma <- function(x) 1/x
> mu <- function(x) 3*x
> .Call(mu, sigma, ..)
> mu <- function(x) 3*x
> sigma <- function(x) 1/x
> .Call(mu, sigma, ..)
>
From the R side, the code should look like the following mu sde . sim < - f u n c t i o n ( X0 , Dt , N , drift , d i f f ){
r e t u r n ( . C a l l ( " sde _ sim " , X0 , Dt , as . i n t e g e r ( N ) , drift , diff , . G l o b a l E n v , P A C K A G E = " sde " ) )
}
where drift and diff are two user defined R function which are passed to C to be evaluated.
From the user point of view, this is the way to use it mu < - f u n c t i o n ( x ) -3 * x
s i g m a < - f u n c t i o n ( x ) s q r t ( x )
sde . sim (1 , 0.01 , 100 , mu , s i g m a ) mu1 < - f u n c t i o n ( x ) - x
s i g m a 1 < - f u n c t i o n ( x ) s q r t (1+ x ^2) sde . sim (1 , 0.01 , 100 , mu1 , s i g m a 1 )
The interesting part is the C side.
S E X P sde _ sim ( S E X P x0 , S E X P delta , S E X P N , S E X P d , S E X P s , S E X P rho ) {
/ / ( . . . ) d e t a i l s are s k i p p e d
/ / we n e e d to p r o t e c t our o b j e c t s a g a i n s t GC P R O T E C T ( x0 = AS _ N U M E R I C ( x0 ));
P R O T E C T ( d e l t a = AS _ N U M E R I C ( d e l t a ));
P R O T E C T ( c o r r = AS _ L O G I C A L ( c o r r ));
n = * I N T E G E R _ P O I N T E R ( N ); / / we a c c e s s to N as a p o i n t e r to i n t e g e r P R O T E C T ( X = NEW _ N U M E R I C ( n + 1 ) ) ; / / t h i s is a l l o c a t e d as R o b j e c t
rX = R E A L ( X ); / / now we can a c c e s s it as if it w e r e a malloc - ed C v e c t o r G e t R N G s t a t e (); / / we n e e d t h i s c a l l s t a r t s i m u l a t i n g
for ( i =1; i < n +1; i + + ) { T1 = T1 + D E L T A ;
Z = r n o r m (0 , sdt );
tmp = rX [ i - 1];
rX [ i ] = tmp + f e v a l ( T1 , tmp , d , rho ) * D E L T A + f e v a l ( T1 , tmp , s , rho ) * Z ; }
P u t R N G s t a t e (); / / w h e n d o n e we r e l e a s e it U N P R O T E C T ( 9 ) ; / / we u n p r o t e c t all
The C function feval is inteded to make it possible to evaluate true R functions call into R
d o u b l e f e v a l ( d o u b l e t , d o u b l e x , S E X P f , S E X P rho ) {
d o u b l e val = 0 . 0 ;
S E X P R _ fcall , tpar , x p a r ;
P R O T E C T ( t p a r = a l l o c V e c t o r ( REALSXP , 1 ) ) ; P R O T E C T ( x p a r = a l l o c V e c t o r ( REALSXP , 1 ) ) ; R E A L ( t p a r ) [ 0 ] = t ;
R E A L ( x p a r ) [ 0 ] = x ;
P R O T E C T ( R _ f c a l l = a l l o c L i s t ( 3 ) ) ; / / we a l l o c a t e a l i s t of l e n g t h 3 S E T C A R ( R _ fcall , f ); / / the f i r s t e l e m e n t is the f u n c t i o n to e v a l u a t e SET _ T Y P E O F ( R _ fcall , L A N G S X P ); / / we t e l l R t h a t t h i s is an
/ / ‘ e x p r e s s i o n ’ / ‘ l a n g u a g e ’ o b j e c t S E T C A D R ( R _ fcall , t p a r ); / / we set the s e c o n d : i . e . f ( t , ? ? ?
S E T C A D D R ( R _ fcall , x p a r ); / / and t h i r d e l e m e n t : f ( t , x ) < - c o m p l e t e now val = * N U M E R I C _ P O I N T E R ( e v a l ( R _ fcall , rho )); / / R , p l e a s e e v a l u a t e it
/ / in the e n v i r o n m e n t ‘ rho ’ U N P R O T E C T ( 3 ) ;
r e t u r n ( val );
}
and remind that we use it something like feval(T1, tmp,d,rho) in the previous sde sim C code.
Managing big objects
As said, when (big) objects are passed from one function to another they are copied. If you have a 32bit machine and your object is as big as 2.5∗ GB of memory, you can’t essentially work with it (in a finite time or at all) unless you work properly.
We have seen how to make use of R functions from C to keep our code flexible but fast (and the above is faster than its pure R coun- terpart).
But there are cases in which we want to work in the other direction.
For example, we want to allocate an object in C, say a big matrix or a very long vector, and we want to update or access it from R
R objects are not static and are garbage collected when their are no longer in use (so the user don’t need to take care of releasing them).
The only way to treat R as “temporary” static object is to go along the sequence .Call, PROTECT, etc.
The only exception are environment, which are not copied during function calls but passed by reference (but when an object of an environment is passed to function it gets copied anyway)
To get an idea. Consider the following trivial case: a matrix of 10000x10000 (i.e. a vector of length 1e8) is allocated. It is as small as 380MB of memory. This can be a distance matrix of a data set of 10000 observations (maybe not efficiently stored, but just to show the issue).
> gc ()
u s e d ( Mb ) gc t r i g g e r ( Mb ) max u s e d ( Mb ) N c e l l s 1 7 8 0 3 0 4.8 4 0 7 5 0 0 1 0 . 9 3 5 0 0 0 0 9.4 V c e l l s 7 3 9 6 9 0.6 7 8 6 4 3 2 6.0 3 3 6 6 7 9 2.6
> x < - 1 : ( 1 e8 )
> gc ()
u s e d ( Mb ) gc t r i g g e r ( Mb ) max u s e d ( Mb ) N c e l l s 1 7 8 2 1 2 4.8 4 0 7 5 0 0 1 0 . 9 3 5 0 0 0 0 9.4 V c e l l s 5 0 0 7 3 9 7 7 3 8 2 . 1 5 0 4 3 2 0 2 9 3 8 4 . 8 5 0 0 7 3 9 9 7 3 8 2 . 1
Now we add 1 to all the elements of x
> s y s t e m . t i m e ( x < - x +1)
[1] 2 . 8 5 7 2 . 7 3 9 5 . 6 0 7 0 . 0 0 0 0 . 0 0 0
> gc ()
u s e d ( Mb ) gc t r i g g e r ( Mb ) max u s e d ( Mb )
N c e l l s 1 7 8 2 9 1 4.8 4 0 7 5 0 0 1 0 . 9 3 5 0 0 0 0 9.4
V c e l l s 1 0 0 0 7 3 9 9 8 7 6 3 . 6 2 5 0 3 3 1 6 7 4 1 9 0 9 . 9 2 5 0 0 7 4 0 3 0 1 9 0 8 . 0
> 2 5 0 3 3 1 6 7 4 / 5 0 4 3 2 0 2 9 [1] 4 . 9 6 3 7 4 4
as you see, the size of x has been doubled. Moreover, this “simple”
summation required more than 5 seconds on a 6 GB, Dual 2.5GhZ machine on which R is the only active task!
Also notice how much memory is needed in this simple case Can you explain where this comes from?
Consider the following simple manipulation of vectors
> gc ()
u s e d ( Mb ) gc t r i g g e r ( Mb ) max u s e d ( Mb ) N c e l l s 1 7 8 0 3 0 4.8 4 0 7 5 0 0 1 0 . 9 3 5 0 0 0 0 9.4 V c e l l s 7 3 9 6 9 0.6 7 8 6 4 3 2 6.0 3 3 6 6 7 9 2.6
> x < - 1 : ( 1 e8 )
> gc ()
u s e d ( Mb ) gc t r i g g e r ( Mb ) max u s e d ( Mb ) N c e l l s 1 7 8 2 1 2 4.8 4 0 7 5 0 0 1 0 . 9 3 5 0 0 0 0 9.4 V c e l l s 5 0 0 7 3 9 7 7 3 8 2 . 1 5 0 4 3 2 0 2 9 3 8 4 . 8 5 0 0 7 3 9 9 7 3 8 2 . 1
> y < - 1+ x
> gc ()
u s e d ( Mb ) gc t r i g g e r ( Mb ) max u s e d ( Mb ) N c e l l s 1 7 8 2 1 6 4.8 4 0 7 5 0 0 1 0 . 9 3 5 0 0 0 0 9.4 V c e l l s 1 5 0 0 7 3 9 7 7 1 1 4 5 . 0 2 5 0 3 3 1 6 5 3 1 9 0 9 . 9 2 5 0 0 7 3 9 9 6 1 9 0 8 . 0
> z < - 1+ y
> z < - x + y
R ( 1 9 1 9 3 ) m a l l o c : * * * vm _ a l l o c a t e ( s i z e = 8 0 0 0 0 2 0 4 8 ) f a i l e d ( e r r o r c o d e =3) R ( 1 9 1 9 3 ) m a l l o c : * * * e r r o r : can ’ t a l l o c a t e r e g i o n
R ( 1 9 1 9 3 ) m a l l o c : * * * set a b r e a k p o i n t in s z o n e _ e r r o r to d e b u g
R ( 1 9 1 9 3 ) m a l l o c : * * * vm _ a l l o c a t e ( s i z e = 8 0 0 0 0 2 0 4 8 ) f a i l e d ( e r r o r c o d e =3) R ( 1 9 1 9 3 ) m a l l o c : * * * e r r o r : can ’ t a l l o c a t e r e g i o n
Although there are several ways to handle big objects not in memory, for example storing them in DB servers or using the new package SQLiteDF, for the reason of speed or because your procedure cannot work if not on the whole data set some different approach should be taken.
To show a concrete application of the proposed approach we intro- duce a motivating example: the rrp package.
The rrp package implements the Random Recursive Partitioning al- gorithm which is essentially a method to obtain a proximity measure between observations in a data set using random partitioning of the set of observations.
The initial proximity is a 0-matrix. In each random partition, if two observations i and j are in the same cell, the proximity is incremented by 1. After R replications, the final proximity is divided by R.
−2 −1 0 1 2
0 2 4 6 810
−1.5
−1.0
−0.5 0.0
0.5 1.0
1.5 2.0
x2
x1
y
●
●
●
●
●
●
●
●
●
●
x2>=0.5996|
x1>=1.01
x1< 0.2951
x1>=−0.8554
x2< −0.1811 2
4
5.5 7.5
9
10
●
● ●
●
●
●
●
●
●
●
x1
x2
−0.8554 0.2951 1.01
−0.18110.5996
2 4 6 810
−1.0
−0.5 0.0
0.5 1.0
1.5 2.0 x1
y
●
●
●
●
●
●
●
●
● ●
x1< −0.6237|
x1>=0.0999
x1< 1.01 x2< −0.1811
1.5
●
● ●
●
●
●
●
●
●
x2 −0.1811
−2 −1 0 1 2
0 2 4 6 810
−1.5
−1.0
−0.5 0.0
0.5 1.0
1.5 2.0
x2
x1
y
●
●
●
●
●
●
●
●
●
●
x1>=−0.6237|
x1< 1.01
x2< −0.5143
x2>=−0.1811
x1< −0.3953 1
2 4
7
7.5 9
●
● ●
●
●
●
●
●
●
●
x1
x2
−0.6237 1.01
−0.5143
To store a dissimilarity/proximity/distance matrix for n observations (because of symmetry) we can just retain the upper/lower triangular matrix, a whole of n∗(n−1)/2 observations. These are usually stored in memory as vectors with some attributes.
The problem is that the matrix has to be updated in a for loop and the update may lead to the problem of allocation already mentioned.
To avoid such problem one strategy is to allocate the object at C level, pass the reference of this object back to R, do the slicing and then turn back to C for the update.
But how? One way to do this is to use External Pointer objects externalPtr are are R objects which contain just the address of the underlying raw data object allocated at C level.
It is not possible to change their value(s) or retrieve them without using accessor functions
But they are constructed so that R can protect and garbage collect then as needed
Of course, they can be manipulated quite efficiently at C level
S E X P n e w X P t r ( S E X P n , S E X P k ){
int i , N , * np ;
S E X P distC , distR , dim ; d o u b l e K , * kp , * d ;
if ( ! i s N u m e r i c ( k )) e r r o r ( " ‘k ’ m u s t be n u m e r i c " );
if ( ! i s I n t e g e r ( n )) e r r o r ( " ‘n ’ m u s t be an i n t e g e r " );
np = I N T E G E R _ P O I N T E R ( n );
kp = N U M E R I C _ P O I N T E R ( k );
N = * np ; K = * kp ;
P R O T E C T ( d i s t C = a l l o c V e c t o r ( REALSXP , N * ( N -1) / 2 ) ) ; d = R E A L ( d i s t C );
for ( i =0; i < N * ( N -1) / 2; i ++) d [ i ] = K ;
d i s t R = R _ M a k e E x t e r n a l P t r ( d , RRP _ d i s t _ tag , d i s t C );
U N P R O T E C T ( 1 ) ; r e t u r n ( d i s t R );
}
The C function newXPtr uses R MakeExternalPtr function to make the
Let us have a closer look
S E X P R _ M a k e E x t e r n a l P t r ( v o i d * p , S E X P tag , S E X P p r o t );
p : the C pointer
tag : a tag, a sort of signature for this object prot : the protected SEXP
and in the code we have S E X P distC , d i s t R ;
P R O T E C T ( d i s t C = a l l o c V e c t o r ( REALSXP , N * ( N -1) / 2 ) ) ; d = R E A L ( d i s t C );
d i s t R = R _ M a k e E x t e r n a l P t r ( d , RRP _ d i s t _ tag , d i s t C );
U N P R O T E C T ( 1 ) ; r e t u r n ( d i s t R );
What we see from the R side?
The function newXPtr creates an object of size n ∗(n−1)/2 and fills it up with a constant k
n e w X P t r < - f u n c t i o n ( n , k =0) {
tmp < - . C a l l ( " n e w X P t r " , as . i n t e g e r ( n ) , as . n u m e r i c ( k ) , P A C K A G E = " rrp " )
a t t r i b u t e s ( tmp ) < - l i s t ( c l a s s = c ( c l a s s ( tmp ) , " X P t r " ) , S i z e = n ) r e t u r n ( i n v i s i b l e ( tmp ))
}
> a < - n e w X P t r (10 , 1)
> a
< p o i n t e r : 0 x 9 1 7 d 7 d 8 >
a t t r ( , " c l a s s " )
[1] " e x t e r n a l p t r " " X P t r "
So XPtr object from the rrp package are just pointers to C objects which are essentially equivalent to R dist objects, with the difference that dist objects are always copied.
> a < - n e w X P t r (10 , 1)
> a
< p o i n t e r : 0 x 9 1 7 d 7 d 8 >
a t t r ( , " c l a s s " )
[1] " e x t e r n a l p t r " " X P t r "
a t t r ( , " S i z e " ) [1] 10
To access or manipulated externalPtr we need to create accessor functions. For example, like the following XPtrToDist
> ( X P t r T o D i s t ( a ))
1 2 3 4 5 6 7 8 9
2 1
3 1 1
4 1 1 1
5 1 1 1 1
6 1 1 1 1 1
7 1 1 1 1 1 1
8 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1
This function is made of two parts
X P t r T o D i s t < - f u n c t i o n ( d ) {
tmp < - . C a l l ( " X P t r T o N u m e r i c " , d , P A C K A G E = " rrp " )
a t t r i b u t e s ( tmp ) < - l i s t ( S i z e = a t t r ( d , " S i z e " ) , c l a s s = " d i s t " ) r e t u r n ( i n v i s i b l e ( tmp ))
}
and
S E X P X P t r T o N u m e r i c ( S E X P X P t r ) {
S E X P val ;
C H E C K _ RRP _ O B J E C T ( X P t r );
P R O T E C T ( val = R _ E x t e r n a l P t r P r o t e c t e d ( X P t r ));
U N P R O T E C T ( 1 ) ; r e t u r n ( val );
}
Then, we can assign the result somewhere and R will think this is a dist object which might be used, for example, in cluster analysis
> x < - X P t r T o D i s t ( a )
> str ( x )
C l a s s ’ d i s t ’ a t o m i c [ 1 : 4 5 ] 1 1 1 1 1 1 1 1 1 1 ...
.. - a t t r ( * , " S i z e " )= num 10
> x
1 2 3 4 5 6 7 8 9
2 1
3 1 1
4 1 1 1
5 1 1 1 1
6 1 1 1 1 1
7 1 1 1 1 1 1
8 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1
> as . d i s t ( m a t r i x (1 ,10 ,10)) 1 2 3 4 5 6 7 8 9
2 1
3 1 1
4 1 1 1
5 1 1 1 1
6 1 1 1 1 1
7 1 1 1 1 1 1
8 1 1 1 1 1 1 1
9 1 1 1 1 1 1 1 1
10 1 1 1 1 1 1 1 1 1
In R, dist objects to do not have easy indexing as matrix, because they are just vector even if they are represented as triangular matrix.
In the rrp package, such a dist (XPtr) object has to be updated for groups of observations, so I wanted something similar to this
> M < - m a t r i x (0 ,5 ,5)
> M [ 1 : 3 , 1 : 3 ] < - M [ 1 : 3 , 1 : 3 ] - 1
> M [ 4 : 5 , 4 : 5 ] < - M [ 4 : 5 , 4 : 5 ] + 1
>
> M
[ ,1] [ ,2] [ ,3] [ ,4] [ ,5]
[1 ,] -1 -1 -1 0 0
[2 ,] -1 -1 -1 0 0
[3 ,] -1 -1 -1 0 0
[4 ,] 0 0 0 1 1
[5 ,] 0 0 0 1 1
> as . d i s t ( M )
1 2 3 4
This is done using some accessor function addXPtr in which a list of indexes and a list of values is passed to the function
> d < - n e w X P t r (5 ,0)
> x < - l i s t (1:3 , 4 : 5 )
>
> a d d X P t r ( d , x , c ( -1 ,+1))
>
> ( X P t r T o D i s t ( d ))
1 2 3 4
2 -1 3 -1 -1
4 0 0 0
5 0 0 0 1
But this way is very efficient and fast because in
> M [ 1 : 3 , 1 : 3 ] < - M [ 1 : 3 , 1 : 3 ] - 1
the “copy” effect arises. Again, imagine to do this on dist objects or matrixes of dimension 20000x20000
Internally, addXPtr does the dirty job of accessing the current posi- tion of the vector from matrix-like notation.
S E X P a d d X P t r ( S E X P XPtr , S E X P x , S E X P k ) {
d = R _ E x t e r n a l P t r A d d r ( X P t r ); # get the p o i n t e r P R O T E C T ( x = AS _ L I S T ( x ));
P R O T E C T ( k = AS _ N U M E R I C ( k ));
len = L E N G T H ( R _ E x t e r n a l P t r P r o t e c t e d ( X P t r )); # get the ‘ prot ’ for ( h =0; h < n2 ; h + + ) {
n1 = L E N G T H ( V E C T O R _ ELT ( x , h ));
x I d x = I N T E G E R _ P O I N T E R ( AS _ I N T E G E R ( V E C T O R _ ELT ( x , h ) ) ) ; for ( i =0; i < n1 -1; i + + ) {
for ( j = i +1; j < n1 ; j + + ) {
pos = ( x I d x [ i ] -1) * (2 * N - x I d x [ i ]) / 2 + x I d x [ j ] - x I d x [ i ];
d [ pos -1] = d [ pos -1] + k a p p a [ h ];
} }
As said, dist objects have no natural indexing in R. Some solution exists in the Bioconductor package, but they are not as fast or as flexible as in our approach.
The current solution permitted to increase the dimension of the data set which can be handled by the RRP algorithm. It also made the algorithm about 10 times faster!
Concluding remarks
We have seen that to obtain the best performance out of R some knowledge on how the language works and is designed are needed.
We have shown a couple of applications where the need was to go from R to C or vice versa.
R is good in many tasks, so the average user will probably never need to go at C level and the good developer do not need to go at C level as well, unless strictly needed.
The knowledge on when this is a real needed is the knowledge about R itself.