cuda_pro_tip

CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM).  In this post I’ll show you how to leverage these batched routines from CUDA Fortran.

The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type.  Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.

Writing Interfaces to Batched cuBLAS Routines

At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:

interface 
  integer(c_int) function &
      cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) &
      bind(c,name='cublasSgetrfBatched') 
    use iso_c_binding 
    use cublas 
    type(cublasHandle), value :: h 
    integer(c_int), value :: n 
    type(c_devptr), device :: Aarray(*) 
    integer(c_int), value :: lda
    integer(c_int), device :: ipvt(*) 
    integer(c_int), device :: info(*) 
    integer(c_int), value :: batchSize 
  end function cublasSgetrfBatched
end interface

The arguments of cublasSgetrfBatched() are: Continue reading

Six Ways to SAXPY

This post is a GPU program chrestomathy. What’s a Chrestomathy, you ask?

In computer programming, a program chrestomathy is a collection of similar programs written in various programming languages, for the purpose of demonstrating differences in syntax, semantics and idioms for each language. [Wikipedia]

There are several good examples of program chrestomathies on the web, including Rosetta Code andNBabel, which demonstrates gravitational N-body simulation in multiple programming languages. In this post I demonstrate six ways to implement a simple SAXPY computation on the CUDA platform. Why is this interesting? Because it demonstrates the breadth of options you have today for programming NVIDIA GPUs, and it covers the three main approaches to GPU computing: GPU-accelerated libraries, GPU compiler directives, and GPU programming languages.

SAXPY stands for “Single-Precision A·X Plus Y”.  It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like this.

void saxpy(int n, float a, float *x, float *y)
{
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];
}

// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);

Continue reading