Rlogo

Accelerate R Applications with CUDA

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through CRAN (the Comprehensive R Archive Network).

However, R, like many other high-level languages, is not performance competitive out of the box with lower-level languages like C++, especially for highly data- and computation-intensive applications. R programs tend to process large amounts of data, and often have significant independent data and task parallelism. Therefore, R applications stand to benefit from GPU acceleration. This way, R users can benefit from R’s high-level, user-friendly interface while achieving high performance.

In this article, I will introduce the computation model of R with GPU acceleration, focusing on three topics:

  • accelerating R computations using CUDA libraries;
  • calling your own parallel algorithms written in CUDA C/C++ or CUDA Fortran from R; and
  • profiling GPU-accelerated R applications using the CUDA Profiler.

The GPU-Accelerated R Software Stack

Figure 1 shows that there are two ways to apply the computational power of GPUs in R:

  1. use R GPU packages from CRAN; or
  2. access the GPU through CUDA libraries and/or CUDA-accelerated programming languages, including C, C++ and Fortran.
modelFigure 1: The R + GPU software stack.

The first approach is to use existing GPU-accelerated R packages listed under High-Performance and Parallel Computing with R on the CRAN site. Examples include gputools and cudaBayesreg. These packages are very easy to install and use. On the other hand, the number of GPU packages is currently limited, quality varies, and only a few domains are covered. This will improve with time.

The second approach is to use the GPU through CUDA directly. While the CUDA ecosystem provides many ways to accelerate applications, R cannot directly call CUDA libraries or launch CUDA kernel functions. To solve this problem, we need to build an interface to bridge R and CUDA the development layer of Figure 1 shows.  For productivity, we should also encapsulate these interface functions in the R environment, so that the technical changes between the CPU and GPU are transparent to the R user. (Note that this wrapper is optional; the user can also call the R interface function directly from R.)

Accelerate R using CUDA Libraries

As you know, there are many GPU-accelerated libraries (from NVIDIA as well as third-party and open-source libraries) that provide excellent usability, portability and performance. Using GPU-accelerated libraries reduces development effort and risk, while providing support for many NVIDIA GPU devices with high performance. Thus, CUDA libraries are a quick way to speed up applications, without requiring the R user to understand GPU programming.

I will show you step-by-step how to use CUDA libraries in R on the Linux platform. My example for this post uses cuFFT (version 6.5), but it is easy to use other libraries in your application with the same development flow.

The FFT Target Function

In this case, we want to implement an accelerated version of R’s built-in 1D FFT. Here is the description of the R FFT.

Fast Discrete Fourier Transform
Description
    Performs the Fast Fourier Transform of an array.
Usage
    fft(z, inverse = FALSE)
Arguments
z       :   a real or complex array containing the values to be 
            transformed.
inverse :   if TRUE, the unnormalized inverse transform is computed 
            (the inverse has a + in the exponent of e, but here, we 
            do not divide by 1/length(x)).

Writing an Interface Function

First of all, we need to implement the interface code between R and C. The functions we write with (CUDA) C need to satisfy several properties.

  1. C functions called by R must have a void return type, and should return results via the function arguments.
  2. R passes all arguments to C functions by reference (using pointers), so we must dereference the pointers in C (even scalar input arguments).
  3. Each file containing C functions to be called from R should include R.h and the header files for any CUDA libraries used by the function.

The following code defines our interface function.

#include  
#include  <cufft.h>

/* This function is written for R to compute 1D FFT.
   n            - [IN] the number of complex we want to compute
   inverse      - [IN] set to 1 if use inverse mode
   h_idata_re   - [IN] input data from host (R, real part)
   h_idata_im   - [IN] input data from host (R, imaginary part)
   h_odata_re   - [OUT] results (real) allocated by caller
   h_odata_im   - [OUT] results (imaginary) allocated by caller
*/
extern "C"
void cufft(int *n, int *inverse, double *h_idata_re,
           double *h_idata_im, double *h_odata_re, double *h_odata_im)
{
  cufftHandle plan;
  cufftComplex *d_data, *h_data;
  cudaMalloc((void**)&d_data, sizeof(cufftComplex)*(*n));
  h_data = (cufftComplex *) malloc(sizeof(cufftComplex) * (*n));

  // Covert data to cufftComplex type
  for(int i=0; i< *n; i++) {
    h_data[i].x = h_idata_re[i];
    h_data[i].y = h_idata_im[i];
  }

  cudaMemcpy(d_data, h_data, sizeof(cufftComplex) * (*n), 
             cudaMemcpyHostToDevice);

  // Use the CUFFT plan to transform the signal in place.
  cufftPlan1d(&plan, *n, CUFFT_C2C, 1);

  if (!*inverse ) {
    cufftExecC2C(plan, d_data, d_data, CUFFT_FORWARD);
  } else {
    cufftExecC2C(plan, d_data, d_data, CUFFT_INVERSE);
  }

  cudaMemcpy(h_data, d_data, sizeof(cufftComplex) * (*n), 
             cudaMemcpyDeviceToHost);

  // split cufftComplex to double array
  for(int i=0; i<*n; i++) {
    h_odata_re[i] = h_data[i].x;
    h_odata_im[i] = h_data[i].y;
  }

  /* Destroy the CUFFT plan and free memory. */
  cufftDestroy(plan);
  cudaFree(d_data);
  free(h_data);
}

This file includes two header files, “R.h” and “cufft.h”. We write a C function called cufft, and declare it extern "C" so that we can call it from R. The first four arguments are input arguments and the last two are output arguments which we use to pass computation results back to R. Because of data type differences between R and cufft, we pass the real and imaginary parts separately and combine them together to fill the array of cufftComplex structures. We then copy the complex array to the device.

We call cufftExecC2C() to do a 1D FFT, choosing CUFFT_FORWARD or CUFFT_INVERSE depending on the value of the inverse argument. Then we copy the result back from device to host and assign the result to the output arguments h_odata_re and h_odata_im . Finally, we destroy temporary objects and free the allocated memory.

Compile and Link a Shared Object

Now that we’ve written our interface function, we need to compile and link it into a shared object file (.so). We compile the source code to an object file with the optimization option -O3, and then link the object file with libcufft and libR into a shared object file. (NOTE: if you don’t find libR.so in your R lib directory, you can rebuild the R package with ‘–enable-R-shlib’ configuration.)

nvcc -O3 -arch=sm_35 -G -I/usr/local/cuda/r65/include \
    -I/home/patricz/tools/R-3.0.2/include/ \
    -L/home/patricz/tools/R/lib64/R/lib -lR \
    -L/usr/local/cuda/r65/lib64 -lcufft \
    --shared -Xcompiler -fPIC -o cufft.so cufft-R.cu

Load shared object in wrapper

Now that we have our interface function in a shared object, we just need to load cufft-R.so and call our new API. We load external C code in R using the function dyn.load(), and we can use the is.loaded() function to check if our API is available in the R environment.

Once our function is loaded, we call it using R’s .C() function, passing our function name (cufft) as the first argument followed by the arguments for cufft(). More importantly, we must tell R of the explicit type of each argument, using as.<type>(). But it’s tedious and error-prone to call .C() every time we need to use our custom function. So the best practice is to write an interface wrapper in R. As the following code shows, the wrapper loads our shared object, calls cufft() via .C(), and then converts the output into the complex data type. Now, we can use this wrapper routine in the same way as R’s built-in FFT.

cufft1D <- function(x, inverse=FALSE)
{
  if(!is.loaded("cufft")) {
    dyn.load("cufft.so")
  }
  n   <- length(x)
  rst <- .C("cufft",
               as.integer(n),
               as.integer(inverse),
               as.double(Re(z)),
               as.double(Im(z)),
               re=double(length=n),
               im=double(length=n))
  rst <- complex(real = rst[["re"]], imaginary = rst[["im"]])
  return(rst)
}

Execution and Testing

So far, so good! Let’s try our new cufft function in R. You can see that we can now use the GPU-accelerated FFT as easily as R’s built-in FFT, and that both versions compute the same result.

> source("wrap.R")
> num <- 4
> z <-  complex(real = stats::rnorm(num), imaginary = stats::rnorm(num))
> cpu <- fft(z)
[1]  1.140821-1.352756i -3.782445-5.243686i  1.315927+1.712350i -0.249490+1.470354i
> gpu <- cufft1D(z)
[1]  1.140821-1.352756i -3.782445-5.243686i  1.315927+1.712350i -0.249490+1.470354i
> cpu <- fft(z, inverse=T)
[1]  1.140821-1.352756i -0.249490+1.470354i  1.315927+1.712350i -3.782445-5.243686i
> gpu <- cufft1D(z, inverse=T)
[1]  1.140821-1.352756i -0.249490+1.470354i  1.315927+1.712350i -3.782445-5.243686i

Now, let’s see what performance we can gain from GPU acceleration. We tested on an 8-core Intel® Xeon® CPU (E5-2609 @ 2.40GHz / 64GB RAM) and NVIDIA GPU (Tesla K20Xm with 6GB device memory).  As shown in Figure 3, cufft provides 3x-8x speedup compared with R’s built-in FFT.

Figure 3: Performance Improvement from cuFFT in R
Figure 3: Performance Improvement from cufft in R

Accelerate R using CUDA C/C++/Fortran

When R GPU packages and CUDA libraries don’t offer the functionality you need, you can write custom GPU-accelerated code using CUDA. The process is very similar to our previous example of a CUDA library call; the only difference is that you need to write a parallel function yourself.

In the following example code, we implement vector addition on the GPU and call it from our interface function gvectorAdd. In practice you can implement any algorithm and call it from an interface function as in this example. The remaining steps (building a shared object and writing an R wrapper) are similar to my previous example.

extern "C" void  gvectorAdd(double *A, double *B, double *C, int *n);

__global__ void
vectorAdd(const double *A, const double *B, double *C, int numElements)
{
  int i = blockDim.x * blockIdx.x + threadIdx.x;

  if(i < numElements)
  {
    C[i] = A[i] + B[i];
  }
}

void gvectorAdd(double *A, double *B, double *C, int *n) 
{
  // Device Memory
  double *d_A, *d_B, *d_C;

  // Define the execution configuration
  dim3 blockSize(256,1,1);
  dim3 gridSize(1,1,1);

  gridSize.x = (*n + blockSize.x - 1) / blockSize.x;

  // Allocate output array
  cudaMalloc((void**)&d_A, *n * sizeof(double));
  cudaMalloc((void**)&d_B, *n * sizeof(double));
  cudaMalloc((void**)&d_C, *n * sizeof(double));

  // copy data to device
  cudaMemcpy(d_A, A, *n * sizeof(double), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, B, *n * sizeof(double), cudaMemcpyHostToDevice);

  // GPU vector add
  vectorAdd<<<gridsize,blocksize>>>(d_A, d_B, d_C, *n);

  // Copy output
  cudaMemcpy(C, d_C, *n * sizeof(double), cudaMemcpyDeviceToHost);

  cudaFree(d_A);
  cudaFree(d_B);
  cudaFree(d_C);
}

CUDA Performance Profiling

When developing GPU code or using GPU packages in R, you may encounter functional and performance problems. To solve these problems efficiently, you can use CUDA developer tools such as cuda-gdbNsight and the NVIDIA Visual Profiler or nvprof. For this example, I will show you how to profile our cuFFT example above using nvprof, the command line profiler included with the CUDA Toolkit (check out the post about how to use nvprof to profile any CUDA program). My testing environment is R  3.0.2 on a 12-core Intel® Xeon® CPU (E5645 @ 2.40GHz and 24G RAM) combined with an NVIDIA Tesla GPU (K20Xm with 6GB device memory).

There are two approaches to launching nvprof with R.

  1. Use nvprof as a wrapper to launch R by typing nvprof R, then run the GPU solver and exit, or use nvprof to launch R with a batch script: nvprof R CMD BATCH script.R
  2. Launch nvprof --profile-all-processes in a separated console. R executes the GPU solver and exits as normal, and nvprof in another console captures the GPU behavior and prints the profile.

Here is the nvprof output for our FFT wrapper function, for a signal of 2^26 elements:

==18067== Profiling application: /home/patricz/tools/R/lib64/R/bin/exec/R
==18067== Profiling result:
Time(%)      Time       Calls   Avg       Min       Max     Name
 55.41%  277.50ms         1  277.50ms  277.50ms  277.50ms  [CUDA memcpy HtoD]
 39.30%  196.82ms         1  196.82ms  196.82ms  196.82ms  [CUDA memcpy DtoH]
  1.35%  6.7531ms         1  6.7531ms  6.7531ms  6.7531ms  void spRadix0128B::kernel1Mem(Complex*, Complex const *, unsigned int, unsigned int, unsigned int, divisor_t, Complex const *, Complex const *, coordDivisors_t, Coord, Coord, unsigned int, unsigned int, float, int, int)
  1.34%  6.7084ms         1  6.7084ms  6.7084ms  6.7084ms  void spRadix0032B::kernel1Mem(Complex*, Complex const *, unsigned int, unsigned int, unsigned int, divisor_t, Complex const *, Complex const *, coordDivisors_t, Coord, Coord, unsigned int, unsigned int, float, int, int)
  1.30%  6.5350ms         1  6.5350ms  6.5350ms  6.5350ms  void spRadix0128C::kernel1Mem(Complex*, Complex const *, unsigned int, unsigned int, unsigned int, divisor_t, Complex const *, Complex const *, coordDivisors_t, Coord, Coord, unsigned int, unsigned int, float, int, int)
  1.30%  6.4979ms         1  6.4979ms  6.4979ms  6.4979ms  void spRadix0128B::kernel3Mem(Complex*, Complex const *, unsigned int, unsigned int, divisor_t, coordDivisors_t, Coord, unsigned int)
==14905== API calls:
Time(%)  Time     Calls    Avg        Min        Max       Name
52.35%   502.32ms   2    251.16ms   223.84ms   278.49ms   cudaMemcpy
24.08%   231.04ms   3    77.012ms   491.74us   229.92ms   cudaFree
23.02%   220.89ms   2    110.44ms   580.53us   220.30ms   cudaMalloc
0.41%    3.9122ms   664  5.8910us   278ns      232.60us   cuDeviceGetAttribute
0.07%    625.78us   8    78.222us   67.720us   97.719us   cuDeviceTotalMem
0.05%    483.86us   8    60.482us   50.428us   106.37us   cuDeviceGetName
0.01%    81.052us   4    20.263us   13.612us   36.908us   cudaLaunch
0.01%    79.330us   1    79.330us   79.330us   79.330us   cudaGetDeviceProperties
0.00%    24.473us   56   437ns      300ns      1.0670us   cudaSetupArgument
0.00%    20.558us   7    2.9360us   538ns      9.2310us   cudaGetDevice
0.00%    14.040us   4    3.5100us   2.0570us   7.1580us   cudaFuncSetCacheConfig
0.00%    7.4470us   1    7.4470us   7.4470us   7.4470us   cudaDeviceSynchronize
0.00%    4.9390us   12   411ns      283ns      597ns      cuDeviceGet
0.00%    3.7660us   3    1.2550us   417ns      2.5800us   cuDeviceGetCount
0.00%    3.2650us   4    816ns      568ns      1.4930us   cudaConfigureCall
0.00%    2.6450us   4    661ns      314ns      1.3870us   cudaPeekAtLastError
0.00%    1.9140us   4    478ns      402ns      548ns      cudaGetLastError
0.00%    1.1380us   1    1.1380us   1.1380us   1.1380us   cuDriverGetVersion
0.00%    1.1310us   1    1.1310us   1.1310us   1.1310us   cuInit

Summary

In this post, I introduced the R computation model with GPU acceleration and showed how to use the CUDA ecosystem to accelerate your R applications, and how to profile GPU performance within R.

Try CUDA with your R applications now, and have fun!

∥∀

About Patric Zhao

Patric is Senior GPU Architect in HPC field at NVIDIA. He has seven years of experience developing scientific and engineering applications and is experienced in parallelization, performance modeling and architecture-specific tuning. Patric is currently working on Modular Dynamic and Big Data projects. Before joining NVIDIA, Patric worked on distributed processing and algorithm optimization for EDA software at Cadence. Follow @PatricZhao on Twitter
  • Patrick Reily

    Thanks Patric, great overview!
    We work in a Windows environment and have found very little compiled R application that can take advantage of CUDA. The matrix operations we do are highly divisible (perfect for CUDA) but most of what we do multi-threads over CPU cores.
    Can you recommend an R solution that will allow us to parallel process with CUDA in a windows environment.
    Thanks
    Pat

  • Christian Sagmeister

    Hi Patric,

    Thank you very much for this nice tutorial!
    I have a question especially regarding the FFT.

    I was using your example:

    num <- 4
    set.seed(5)
    z <- complex(real = stats::rnorm(num), imaginary = stats::rnorm(num))
    cpu <- fft(z)
    gpu sum(cpu)
    [1] -3.363422+6.845763i
    > sum(gpu)
    [1] -3.363422+6.845764i
    It does not look severe, but since I am using FFT on a very precise level, I found that this numerical error is multiplying itself rather severely. I am using CUDA 6.5.12 for my analysis.

    Do you have any suggestion what I could do?