cuda_pro_tip

CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM).  In this post I’ll show you how to leverage these batched routines from CUDA Fortran.

The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type.  Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.

Writing Interfaces to Batched cuBLAS Routines

At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:

interface 
  integer(c_int) function &
      cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) &
      bind(c,name='cublasSgetrfBatched') 
    use iso_c_binding 
    use cublas 
    type(cublasHandle), value :: h 
    integer(c_int), value :: n 
    type(c_devptr), device :: Aarray(*) 
    integer(c_int), value :: lda
    integer(c_int), device :: ipvt(*) 
    integer(c_int), device :: info(*) 
    integer(c_int), value :: batchSize 
  end function cublasSgetrfBatched
end interface

The arguments of cublasSgetrfBatched() are:

  • h, the cuBLAS handle obtained from cublasCreate();
  • n, the size of the n×n matrices;
  • Aarray,  the array pointers to the matrices;
  • lda, the leading dimension of the matrices;
  • ipvt, an output array containing pivot information;
  • info, another output array containing factorization information;
  • and batchSize, the number of independent n×n matrices on which to perform factorization.

Of particular importance in this interface is the use of the variable attributes device and value. The cuBLAS handle h, the size of the matrices n, the leading dimensions lda, and the number of matrices batchSize are all scalar parameters declared in host memory and are therefore passed by value. The integer arrays ipvt and info are output arrays allocated on the device using the device attribute. Of special note in this interface is the array of pointers Aarray.

  type(c_devptr), device :: Aarray(*)

which is a device array of device pointers.  Contrast this to a declaration such as:

  type(c_devptr) :: Aarray(*)

which is a host array of device pointers. The batched cuBLAS calls need the former declaration because cuBLAS distributes matrices to kernel threads on the device.

Calling Batched cuBLAS from Host Code

Having defined our interface to cublasSgetrfBatched(), we can now turn to the host code that calls this function.  The complete code follows.

program testgetrfBatched
  use cudafor 
  use cublas 
  use iso_c_binding  
  implicit none 
  integer, parameter :: n=2, nbatch=3, lda=n
  real :: a(n,n,nbatch) 
  real, device :: a_d(n,n,nbatch)
  type(c_devptr) :: devPtrA(nbatch) 
  type(c_devptr), device :: devPtrA_d(nbatch) 
  type(cublasHandle) :: h1 
  integer  :: ipvt(n*nbatch), info(nbatch) 
  integer, device  :: ipvt_d(n*nbatch), info_d(nbatch)
  integer :: i, k, istat

  interface 
    integer(c_int) function cublasSgetrfBatched(h, n, Aarray, lda, ipvt, info, batchSize) &
      bind(c,name='cublasSgetrfBatched')
      use iso_c_binding 
      use cublas 
      type(cublasHandle), value :: h 
      integer(c_int), value :: n 
      type(c_devptr), device :: Aarray(*) 
      integer(c_int), value :: lda
      integer(c_int), device :: ipvt(*)
      integer(c_int), device :: info(*)
      integer(c_int), value :: batchSize
    end function cublasSgetrfBatched
  end interface

  ! intitialize arrays and transfer to device
  do k = 1, nbatch
    a(1,1,k) = 6.0*k
    a(2,1,k) = 4.0*k
    a(1,2,k) = 3.0*k
    a(2,2,k) = 3.0*k
  end do
  a_d = a

  write(*,"(/,'Input:')")
  do k = 1, nbatch
    write(*,"(2x,'Matrix: ', i0)") k
    do i=1, n
      write(*,*) a(i,:,k)
    enddo
  enddo

  ! build an array of pointers

  do k = 1, nbatch
    devPtrA(k) = c_devloc(a_d(1,1,k))
  end do
  devPtrA_d = devPtrA

  ! create handle, call cublasSgetrfBatched, and destroy handle

  istat = cublasCreate(h1)
  if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasCreate failed'
  istat= cublasSgetrfBatched(h1, n, devPtrA_d, lda, ipvt_d, info_d, nbatch)
  if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasSgetrfBatched failed: ', istat
  istat = cublasDestroy(h1)
  if (istat /= CUBLAS_STATUS_SUCCESS) write(*,*) 'cublasDestroy failed'

  a = a_d
  write(*,"(/, 'LU Factorization:')")
  do k = 1, nbatch
    write(*,"(2x,'Matrix: ', i0)") k
    do i = 1, n
      write(*,*) a(i,:,k)
    enddo
  enddo
end program testgetrfBatched

While most of this code is straightforward, the code that assembles the device array of device pointers, devPtrA_d, merits some discussion.

  do k = 1, nbatch
    devPtrA(k) = c_devloc(a_d(1,1,k))
  end do
  devPtrA_d = devPtrA

The host function c_devloc() determines the address of the nbatch matrices stored in the three-dimensional array a_d. In general the matrices can be distributed across multiple variables, but we use a single three-dimensional array for convenience. The results from c_devloc() are first stored in devPtrA, a host array of device pointers, and then transferred to the device array devPtrA_ddevPtrA_d is used a few lines later in the call to cublasSgetrfBatched():

istat= cublasSgetrfBatched(h1, n, devPtrA_d, lda, ipvt_d, info_d, nbatch)

Other Batched Routines

We can use a similar approach for the other batched cuBLAS routines: cublas*getriBatched(), cublas*gemmBatched(), and cublas*trsmBatched(). Note that in cublas*gemmBatched() and cublas*trsmBatched(), the parameters alpha and beta are scalar values passed by reference which can reside either on the host or device depending on the cuBLAS pointer mode. In such cases it is helpful to write a generic interface such as the following.

  interface cublasSgemmBatched
     integer(c_int) function &
       cublasSgemmBatched_hpm(h, transa, transb, &
         m, n, k, alpha, Aarray, lda, Barray, ldb, &
         beta, Carray, ldc, batchCount) &
         bind(c,name='cublasSgemmBatched') 
       use iso_c_binding 
       use cublas 
       type(cublasHandle), value :: h 
       integer, value :: transa 
       integer, value :: transb
       integer(c_int), value :: m, n, k
       real :: alpha 
       type(c_devptr), device :: Aarray(*) 
       integer(c_int), value :: lda
       type(c_devptr), device :: Barray(*) 
       integer(c_int), value :: ldb
       real :: beta
       type(c_devptr), device :: Carray(*)
       integer(c_int), value :: ldc
       integer(c_int), value :: batchCount 
     end function cublasSgemmBatched_hpm

     integer(c_int) function &
       cublasSgemmBatched_dpm(h, transa, transb, &
         m, n, k, alpha, Aarray, lda, Barray, ldb, &
         beta, Carray, ldc, batchCount) &
         bind(c,name='cublasSgemmBatched') 
       use iso_c_binding 
       use cublas 
       type(cublasHandle), value :: h 
       integer, value :: transa 
       integer, value :: transb
       integer(c_int), value :: m, n, k
       real, device :: alpha 
       type(c_devptr), device :: Aarray(*) 
       integer(c_int), value :: lda
       type(c_devptr), device :: Barray(*) 
       integer(c_int), value :: ldb
       real, device :: beta
       type(c_devptr), device :: Carray(*)
       integer(c_int), value :: ldc
       integer(c_int), value :: batchCount 
     end function cublasSgemmBatched_dpm
  end interface cublasSgemmBatched

In this interface the *_hmp and *_dpm routines are for host and device pointer modes, respectively, whose interfaces differ only by the absence or presence of the device variable attribute for the alpha and beta arguments. You could further write a wrapper function that calls cublasSetPointerMode() appropriately before executing the batched cuBLAS routine.

Check out more Parallel Forall posts about CUDA Fortran, and more CUDA Pro Tips.

∥∀

About Greg Ruetsch

Greg Ruetsch is a Senior Applied Engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. He holds a Bachelor’s degree in mechanical and aerospace engineering from Rutgers University and a Ph.D. in applied mathematics from Brown University. Prior to joining NVIDIA he has held research positions at Stanford University’s Center for Turbulence Research and Sun Microsystems Laboratories.
  • Austin H

    Great article! Just one question; is this possible without PGI’s CUDA Fortran? I’ve been trying to implement these batched routines from Fortran, but I don’t have access to the PGI compiler. I’ve been able to make some progress with Fortran interfaces to CUDA/cuBLAS library routines, but I can’t get the batched routines to work.

    • Greg Ruetsch

      To use batched CUBLAS routines from regular Fortran, you’d need to write CUDA C to code to manage the array of device pointers and then write some C stubs callable from Fortran.