CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM). In this post I’ll show you how to leverage these batched routines from CUDA Fortran.

The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type. Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.

Writing Interfaces to Batched cuBLAS Routines

At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:

interface 
  integer(c_int) function &
      cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) &
      bind(c,name='cublasSgetrfBatched') 
    use iso_c_binding 
    use cublas 
    type(cublasHandle), value :: h 
    integer(c_int), value :: n 
    type(c_devptr), device :: Aarray(*) 
    integer(c_int), value :: lda
    integer(c_int), device :: ipvt(*) 
    integer(c_int), device :: info(*) 
    integer(c_int), value :: batchSize 
  end function cublasSgetrfBatched
end interface

The arguments of cublasSgetrfBatched() are:

h, the cuBLAS handle obtained from cublasCreate();
n, the size of the n×n matrices;
Aarray, the array pointers to the matrices;
lda, the leading dimension of the matrices;
ipvt, an output array containing pivot information;
info, another output array containing factorization information;
and batchSize, the number of independent n×n matrices on which to perform factorization.

Of particular importance in this interface is the use of the variable attributes device and value. The cuBLAS handle h, the size of the matrices n, the leading dimensions lda, and the number of matrices batchSize are all scalar parameters declared in host memory and are therefore passed by value. The integer arrays ipvt and info are output arrays allocated on the device using the device attribute. Of special note in this interface is the array of pointers Aarray.

  type(c_devptr), device :: Aarray(*)

which is a device array of device pointers. Contrast this to a declaration such as:

  type(c_devptr) :: Aarray(*)

which is a host array of device pointers. The batched cuBLAS calls need the former declaration because cuBLAS distributes matrices to kernel threads on the device.

Calling Batched cuBLAS from Host Code

Having defined our interface to cublasSgetrfBatched(), we can now turn to the host code that calls this function. The complete code follows.

program testgetrfBatched
  use cudafor 
  use cublas 
  use iso_c_binding  
  implicit none 
  integer, parameter :: n=2, nbatch=3, lda=n
  real :: a(n,n,nbatch) 
  real, device :: a_d(n,n,nbatch)
  type(c_devptr) :: devPtrA(nbatch) 
  type(c_devptr), device :: devPtrA_d(nbatch) 
  type(cublasHandle) :: h1 
  integer  :: ipvt(n*nbatch), info(nbatch) 
  integer, device  :: ipvt_d(n*nbatch), info_d(nbatch)
  integer :: i, k, istat

  interface 
    integer(c_int) function cublasSgetrfBatched(h, n, Aarray, lda, ipvt, info, batchSize) &
      bind(c,name='cublasSgetrfBatched')
      use iso_c_binding 
      use cublas 
      type(cublasHandle), value :: h 
      integer(c_int), value :: n 
      type(c_devptr), device :: Aarray(*) 
      integer(c_int), value :: lda
      integer(c_int), device :: ipvt(*)
      integer(c_int), device :: info(*)
      integer(c_int), value :: batchSize
    end function cublasSgetrfBatched
  end interface

  ! intitialize arrays and transfer to device
  do k = 1, nbatch
    a(1,1,k) = 6.0*k
    a(2,1,k) = 4.0*k
    a(1,2,k) = 3.0*k
    a(2,2,k) = 3.0*k
  end do
  a_d = a

  write(*,"(/,'Input:')")
  do k = 1, nbatch
    write(*,"(2x,'Matrix: ', i0)") k
    do i=1, n
      write(*,*) a(i,:,k)
    enddo
  enddo

  ! build an array of pointers

  do k = 1, nbatch
    devPtrA(k) = c_devloc(a_d(1,1,k))
  end do
  devPtrA_d = devPtrA

  ! create handle, call cublasSgetrfBatched, and destroy handle

  istat = cublasCreate(h1)
  if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'
  istat= cublasSgetrfBatched(h1, n, devPtrA_d, lda, ipvt_d, info_d, nbatch)
  if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasSgetrfBatched failed: ', istat
  istat = cublasDestroy(h1)
  if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDestroy failed'

  a = a_d
  write(*,"(/, 'LU Factorization:')")
  do k = 1, nbatch
    write(*,"(2x,'Matrix: ', i0)") k
    do i = 1, n
      write(*,*) a(i,:,k)
    enddo
  enddo
end program testgetrfBatched

While most of this code is straightforward, the code that assembles the device array of device pointers, devPtrA_d, merits some discussion.

  do k = 1, nbatch
    devPtrA(k) = c_devloc(a_d(1,1,k))
  end do
  devPtrA_d = devPtrA

The host function c_devloc() determines the address of the nbatch matrices stored in the three-dimensional array a_d. In general the matrices can be distributed across multiple variables, but we use a single three-dimensional array for convenience. The results from c_devloc() are first stored in devPtrA, a host array of device pointers, and then transferred to the device array devPtrA_d. devPtrA_d is used a few lines later in the call to cublasSgetrfBatched():

istat= cublasSgetrfBatched(h1, n, devPtrA_d, lda, ipvt_d, info_d, nbatch)

Other Batched Routines

We can use a similar approach for the other batched cuBLAS routines: cublas*getriBatched(), cublas*gemmBatched(), and cublas*trsmBatched(). Note that in cublas*gemmBatched() and cublas*trsmBatched(), the parameters alpha and beta are scalar values passed by reference which can reside either on the host or device depending on the cuBLAS pointer mode. In such cases it is helpful to write a generic interface such as the following.

  interface cublasSgemmBatched
     integer(c_int) function &
       cublasSgemmBatched_hpm(h, transa, transb, &
         m, n, k, alpha, Aarray, lda, Barray, ldb, &
         beta, Carray, ldc, batchCount) &
         bind(c,name='cublasSgemmBatched') 
       use iso_c_binding 
       use cublas 
       type(cublasHandle), value :: h 
       integer, value :: transa 
       integer, value :: transb
       integer(c_int), value :: m, n, k
       real :: alpha 
       type(c_devptr), device :: Aarray(*) 
       integer(c_int), value :: lda
       type(c_devptr), device :: Barray(*) 
       integer(c_int), value :: ldb
       real :: beta
       type(c_devptr), device :: Carray(*)
       integer(c_int), value :: ldc
       integer(c_int), value :: batchCount 
     end function cublasSgemmBatched_hpm

     integer(c_int) function &
       cublasSgemmBatched_dpm(h, transa, transb, &
         m, n, k, alpha, Aarray, lda, Barray, ldb, &
         beta, Carray, ldc, batchCount) &
         bind(c,name='cublasSgemmBatched') 
       use iso_c_binding 
       use cublas 
       type(cublasHandle), value :: h 
       integer, value :: transa 
       integer, value :: transb
       integer(c_int), value :: m, n, k
       real, device :: alpha 
       type(c_devptr), device :: Aarray(*) 
       integer(c_int), value :: lda
       type(c_devptr), device :: Barray(*) 
       integer(c_int), value :: ldb
       real, device :: beta
       type(c_devptr), device :: Carray(*)
       integer(c_int), value :: ldc
       integer(c_int), value :: batchCount 
     end function cublasSgemmBatched_dpm
  end interface cublasSgemmBatched

In this interface the *_hmp and *_dpm routines are for host and device pointer modes, respectively, whose interfaces differ only by the absence or presence of the device variable attribute for the alpha and beta arguments. You could further write a wrapper function that calls cublasSetPointerMode() appropriately before executing the batched cuBLAS routine.

Check out more on CUDA Fortran as well as more CUDA Pro Tips at the NVIDIA Developer Blog.

CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

Writing Interfaces to Batched cuBLAS Routines

Calling Batched cuBLAS from Host Code

Other Batched Routines

Related resources

Tags

About the Authors

CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

Writing Interfaces to Batched cuBLAS Routines

Calling Batched cuBLAS from Host Code

Other Batched Routines

Related resources

Tags

About the Authors

Comments

Related posts

Accelerating Fortran DO CONCURRENT with GPUs and the NVIDIA HPC SDK

Drop-in Acceleration of GNU Octave

An Efficient Matrix Transpose in CUDA Fortran

An Easy Introduction to CUDA C and C++

An Easy Introduction to CUDA Fortran

Related posts

CUDA Pro Tip: The Fast Way to Query Device Properties

Pro Tip: Improved GLSL Syntax for Vulkan DescriptorSet Indexing

Pro Tip: Pinpointing Runtime Errors in CUDA Fortran

Pro Tip: Linking OpenGL for Server-Side Rendering

Pro Tip: cuBLAS Strided Batched Matrix Multiply