When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM). In this post I’ll show you how to leverage these batched routines from CUDA Fortran.
The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the
c_devptr type. Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.
Writing Interfaces to Batched cuBLAS Routines
At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran
cublas module, so we first need to define the interface to the
interface integer(c_int) function & cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) & bind(c,name='cublasSgetrfBatched') use iso_c_binding use cublas type(cublasHandle), value :: h integer(c_int), value :: n type(c_devptr), device :: Aarray(*) integer(c_int), value :: lda integer(c_int), device :: ipvt(*) integer(c_int), device :: info(*) integer(c_int), value :: batchSize end function cublasSgetrfBatched end interface
The arguments of
cublasSgetrfBatched() are: Continue reading