Peer-to-Peer Multi-GPU Transpose in CUDA Fortran (Book Excerpt)

CUDA Fortran for Scientists and EngineersThis post is an excerpt from Chapter 4 of the book CUDA Fortran for Scientists and Engineers, by Gregory Ruetsch and Massimiliano Fatica. In this excerpt we extend the matrix transpose example from a previous post to operate on a matrix that is distributed across multiple GPUs. The data layout is shown in Figure 1 for an nx × ny = 1024 × 768 element matrix that is distributed amongst four devices. Each device contains a horizontal slice of the input matrix shown in the figure, as well as a horizontal slice of the output matrix. These input matrix slices of 1024 × 192 elements are divided into four tiles containing 256 × 192 elements each, which are referred to as p2pTile in the code. As the name indicates, the p2pTiles are used for peer-to-peer transfers. After a p2pTile has been transferred to the appropriate device if necessary (tiles on the block diagonal do not need to be transferred as the input and output tiles are on the same device), a CUDA transpose kernel launch transposes the elements within the p2pTile using thread blocks that process smaller tiles of 32 × 32 elements.

Device data layout for peer-to-peer transpose with a nx x ny = 1024 x 768 matrix on four devices. Each device holds a 1024 x 192 horizontal slice of input matrix (as well as a 768 x 256 horizontal slice of the output matrix). Each slice of the input matrix is broken into four tiles of 256 x 192 elements, which are used for  peer-to-peer transfers. The CUDA kernel transposes this tile using 48 thread blocks, each of which processes a 32 x 32 tile.
Device data layout for peer-to-peer transpose with a nx x ny = 1024 x 768 matrix on four devices. Each device holds a 1024 x 192 horizontal slice of input matrix (as well as a 768 x 256 horizontal slice of the output matrix). Each slice of the input matrix is broken into four tiles of 256 x 192 elements, which are used for peer-to-peer transfers. The CUDA kernel transposes this tile using 48 thread blocks, each of which processes a 32 x 32 tile.

The full code is available on the website for the CUDA Fortran for Scientists and Engineers textbook [line numbers below refer to the file CUDAFortranCode/chapter4/P2P/transposeP2P.cuf in the source code archive]. In this post we pull in only the relevant parts for our discussion. We start the discussion of the code with the transpose kernel:

  attributes(global) subroutine cudaTranspose( &
       odata, ldo, idata, ldi)
    implicit none
    real, intent(out) :: odata(ldo,*)
    real, intent(in) :: idata(ldi,*)
    integer, value, intent(in) :: ldo, ldi
    real, shared :: tile(cudaTileDim+1, cudaTileDim)
    integer :: x, y, j

    x = (blockIdx%x-1) * cudaTileDim + threadIdx%x
    y = (blockIdx%y-1) * cudaTileDim + threadIdx%y

    do j = 0, cudaTileDim-1, blockRows
       tile(threadIdx%x, threadIdx%y+j) = idata(x,y+j)
    end do

    call syncthreads()

    x = (blockIdx%y-1) * cudaTileDim + threadIdx%x
    y = (blockIdx%x-1) * cudaTileDim + threadIdx%y

    do j = 0, cudaTileDim-1, blockRows
       odata(x,y+j) = tile(threadIdx%y+j, threadIdx%x)          
    end do
  end subroutine cudaTranspose

This transpose is basically the same kernel we developed in our previous post for the single-GPU transpose, with the exception that two additional parameters are passed to the kernel, ldi and ldo, the leading dimensions of the input and output matrices. These parameters are needed because each kernel call transposes a submatrix of each device’s slice of the matrix. One could do without modifying the kernel at all by copying data to and from a temporary array, but such intermediate data transfers would greatly affect performance. Note that the leading dimension parameters are only used in the declaration of the input and output matrices on lines 17 and 18—the rest of the code is identical to the single-GPU code. Most of the host code performs mundane tasks such as getting the number and types of devices (lines 85-94), checking that all devices are peer-to-peer capable and enabling peer-to-peer communication (lines 96-119), verifying that the matrix divides evenly into the various tile sizes (121-140), printing out the various sizes (lines 146-165), and initializing host data and transposing on the host (lines 169-170). Because we want to overlap the execution of the transpose kernel with the data transfer between GPUs, we want to avoid using the default stream for peer-to-peer communication as well as kernel execution. We want each device to have nDevicesnDevices devices, each requiring nDevices streams, we use a two dimensional variable to hold the stream IDs:

allocate(streamID(0:nDevices-1,0:nDevices-1))
  do p = 0, nDevices-1
     istat = cudaSetDevice(p)
     do stage = 0, nDevices-1
        istat = cudaStreamCreate(streamID(p,stage))
     enddo
  enddo

where the first index to streamID corresponds to the particular device the stream is associated with, and the second index refers to the stages of the calculation. The stages of the transpose, enumerated zero to nDevices-1, are organized as follows. In the zeroth stage, each device transposes the submatrix that lies along the block diagonal of the global matrix, which is depicted in the top diagram of Figure 2. This is done first as no peer-to-peer communication is involved, and the kernel execution can overlap data transfers in the first stage.

Stages of the matrix transpose. In stage zero each device transposes the block along the global matrix diagonal which requires no peer-to-peer communication. In stage one blocks from the first sub-diagonal of the input matrix are transferred to the device holding the respective block super-diagonal, after which the transpose is performed on the receiving device. Subsequent stages (such as stage 2) do the same for subsequent sub- and super-diagonals. The wrapping of the diagonals becomes more pronounced for subsequent stages, culminating in the last stage's communication pattern being the reverse of the first stage.
Stages of the matrix transpose. In stage zero each device transposes the block along the global matrix diagonal which requires no peer-to-peer communication. In stage one blocks from the first sub-diagonal of the input matrix are transferred to the device holding the respective block super-diagonal, after which the transpose is performed on the receiving device. Subsequent stages (such as stage 2) do the same for subsequent sub- and super-diagonals. The wrapping of the diagonals becomes more pronounced for subsequent stages, culminating in the last stage’s communication pattern being the reverse of the first stage.

In stage one, data from what is primarily the first block-subdiagonal of the input matrix is sent to the devices which hold the corresponding first block-superdiagonal, as depicted in Figure 2. After the transfer completes, the receiving device performs the transpose. Note that one of the blocks transferred during stage one is not on the subdiagonal, as we wrap the pattern so that all devices both send and receive data during each stage. The following stages do similar operations on additional block sub- and super-diagonals until all of the blocks have been transposed. The wrapping during these stages becomes more pronounced, so that in the final stage the communication pattern is the reverse of the first stage. In using this arrangement, during each stage other than the zeroth, each device sends and receives a block of data, and both of these transfers can overlap if transferred asynchronously since the devices have separate send and receive copy engines. The distributed global matrices are stored using the derived type deviceArray:

! distributed arrays
  type deviceArray
     real, device, allocatable :: v(:,:)
  end type deviceArray

  type (deviceArray), allocatable :: &
       d_idata(:), d_tdata(:), d_rdata(:)  ! (0:nDevices-1)

This same technique was used in the p2pBandwidth code in Chapter 4 of CUDA Fortran for Scientists and Engineers. Instances of this derived type will be host data, but the member v is device data. There are three allocatable array declarations of this derived type on line 74: d_idata for the input data, d_rdata which is a receive buffer used in the transfers, and d_tdata which holds the final transposed data. These variables are allocated by:

  allocate(d_idata(0:nDevices-1),&
       d_tdata(0:nDevices-1), d_rdata(0:nDevices-1))

which represents decomposition of the global array into the horizontal slices depicted in Figure 2. The members of the derived type hold the horizontal slices, which are allocated and initialized by:

  do p = 0, nDevices-1
     istat = cudaSetDevice(p)
     allocate(d_idata(p)%v(nx,p2pTileDimY), &
          d_rdata(p)%v(nx,p2pTileDimY), &
          d_tdata(p)%v(ny,p2pTileDimX))

     yOffset = p*p2pTileDimY
     d_idata(p)%v(:,:) = h_idata(:, &
          yOffset+1:yOffset+p2pTileDimY)
     d_rdata(p)%v(:,:) = -1.0
     d_tdata(p)%v(:,:) = -1.0
  enddo

where nx and ny are the global matrix sizes, and p2pTileDimY and p2pTileDimX are the sizes of the horizontal slices of the input and transposed matrices, respectively. Note that the device is set on line 194 to the appropriate device before each member v is allocated. Also, since the matrix on the host is stored in h_idata(nx,ny), the offset yOffset is used when initializing d_idata on lines 200-201. The code which performs the various transpose stages is:

  ! Stage 0:
  ! transpose diagonal blocks (local data) before kicking off 
  ! transfers and transposes of other blocks

  do p = 0, nDevices-1
     istat = cudaSetDevice(p)
     if (asyncVersion) then
        call cudaTranspose &
             <<<dimgrid, dimblock,="" 0,="" streamid(p,0)="">>> &
             (d_tdata(p)%v(p*p2pTileDimY+1,1), ny, &
             d_idata(p)%v(p*p2pTileDimX+1,1), nx)
     else
        call cudaTranspose<<<dimgrid, dimblock="">>> &
             (d_tdata(p)%v(p*p2pTileDimY+1,1), ny, &
             d_idata(p)%v(p*p2pTileDimX+1,1), nx)
     endif
  enddo

  ! now send data to blocks to the right of diagonal 
  ! (using mod for wrapping) and transpose

  do stage = 1, nDevices-1    ! stages = offset diagonals
     do rDev = 0, nDevices-1  ! device that receives
        sDev = mod(stage+rDev, nDevices)  ! dev that sends

        if (asyncVersion) then
           istat = cudaSetDevice(rDev)
           istat = cudaMemcpy2DAsync( &
                d_rdata(rDev)%v(sDev*p2pTileDimX+1,1), nx, &
                d_idata(sDev)%v(rDev*p2pTileDimX+1,1), nx, &
                p2pTileDimX, p2pTileDimY, &
                stream=streamID(rDev,stage))
        else
           istat = cudaMemcpy2D( &
                d_rdata(rDev)%v(sDev*p2pTileDimX+1,1), nx, &
                d_idata(sDev)%v(rDev*p2pTileDimX+1,1), nx, &
                p2pTileDimX, p2pTileDimY)
        end if

        istat = cudaSetDevice(rDev)
        if (asyncVersion) then
           call cudaTranspose &
                <<<dimgrid, dimblock,="" 0,="" &="" streamid(rdev,stage)="">>>  &
                (d_tdata(rDev)%v(sDev*p2pTileDimY+1,1), ny, &
                d_rdata(rDev)%v(sDev*p2pTileDimX+1,1), nx)
        else
           call cudaTranspose<<<dimgrid, dimblock="">>> &
                (d_tdata(rDev)%v(sDev*p2pTileDimY+1,1), ny, &
                d_rdata(rDev)%v(sDev*p2pTileDimX+1,1), nx)
        endif
     enddo
  enddo

Stage 0 occurs in the loop on lines 220-232. After the device is set on line 221, the transpose of the diagonal block is performed either using the default, blocking stream on line 228, or in a non-default stream on line 223. The parameter asyncVersion is used to toggle between asynchronous and synchronous execution. The execution configuration used in the kernel launches is determined by:

  dimGrid = dim3(p2pTileDimX/cudaTileDim, &
       p2pTileDimY/cudaTileDim, 1)
  dimBlock = dim3(cudaTileDim, blockRows, 1)

where the thread block is the same as in the single-GPU case, and each kernel launch operates on a submatrix of size p2pTileDimX × p2pTileDimY. The other stages are performed in the loop from line 237-268. After the sending and receiving devices are determined on lines 238 and 239, the peer-to-peer transfer is performed using either cudaMemcpy2DAsync() or cudaMemcpy2D(), depending on asyncVersion. If the asynchronous version is used, then the device is set to the receiving device on line 242, and accordingly the non-default stream used for the transfer is the stream associated with the receiving device. We use the stream associated with the device receiving the data rather than the device sending the data because we want to block the launch of the transpose kernel on the receiving device until the transfer is complete. This is accomplished by default when the same stream is used for the transfer and transpose. For the synchronous data transfer, the device does not need to be specified via cudaSetDevice(). Note that the array receiving the data is in d_rdata. The out-of-place transpose from d_rdata to d_tdata is then performed by the kernel launch on line 257 or 263. Regardless of whether the default stream is used or not, the device must be set as done on line 255. The remainder of the code transfers the data back to the host, checks for correctness, and reports the effective bandwidth. Timing in this case is done using a wall-clock timer. This code uses the C function gettimeofday():

#include 
#include <sys/types.h>
#include 
#include 

double wallclock()
{
  struct timeval tv;
  struct timezone tz;
  double t;

  gettimeofday(&tv, &tz);

  t = (double)tv.tv_sec;
  t += ((double)tv.tv_usec)/1000000.0;

  return t;
}

which is accessed in the Fortran code using the {\tt timing} module:

module timing
  interface wallclock
     function wallclock() result(res) bind(C, name='wallclock')
       use iso_c_binding
       real (c_double) :: res
     end function wallclock
  end interface wallclock
end module timing

Whenever this routine is called, we explicitly check to make sure there is no pending or executing operations on the device:

  ! wait for execution to complete and get wallclock
  do p = 0, nDevices-1
     istat = cudaSetDevice(p)
     istat = cudaDeviceSynchronize()
  enddo
  timeStop = wallclock()

Note that most of this multi-GPU code is overhead associated with declaring and initializing arrays and enabling peer-to-peer communication. The actual data transfers and kernel launches are contained in approximately 50 lines of code, which contains branches for synchronous and asynchronous execution. The transpose kernel itself is only slightly modified from the single-GPU transpose to allow for arbitrary leading dimensions of the arrays. We use a compute node with two devices for running this transpose code. To compare to the single-GPU transpose results, which used 1024 × 1024 matrices, we choose an overall matrix size of 2048 × 2048. In this case each transpose kernel processes a 1024 × 1024 submatrix, the same as in the single-GPU case. When using blocking transfers we obtain the results:

Number of CUDA-capable devices: 2

  Device 0: Tesla M2050
  Device 1: Tesla M2050

Array size: 2048x2048

CUDA block size: 32x8,  CUDA tile size: 32x32
dimGrid: 32x32x1,   dimBlock: 32x8x1

nDevices: 2,  Local input array size: 2048x1024
p2pTileDim: 1024x1024

async mode:  F

Bandwidth (GB/s):   16.43

and when using asynchronous transfers we have:

Number of CUDA-capable devices: 2

  Device 0: Tesla M2050
  Device 1: Tesla M2050

Array size: 2048x2048

CUDA block size: 32x8,  CUDA tile size: 32x32
dimGrid: 32x32x1,   dimBlock: 32x8x1

nDevices: 2,  Local input array size: 2048x1024
p2pTileDim: 1024x1024

async mode:  T

Bandwidth (GB/s):   29.73

While both these numbers fall short of the effective bandwidth achieved in the single-GPU case, one must take into account that half of the data is being transferred over the PCIe bus, which is over an order of magnitude slower than the global memory bandwidth within a GPU. In light of this the use of asynchronous transfers which overlap kernel execution is very advantageous as can be seen from the results. In addition, typically the transpose is used as a means to some other operation that can be done in parallel, in which case cost of the PCI transfer is further amortized. If you enjoyed this post, please check out the book CUDA Fortran for Scientists and Engineers, as well as the related posts linked below.

∥∀

About Greg Ruetsch

Greg Ruetsch is a Senior Applied Engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. He holds a Bachelor’s degree in mechanical and aerospace engineering from Rutgers University and a Ph.D. in applied mathematics from Brown University. Prior to joining NVIDIA he has held research positions at Stanford University’s Center for Turbulence Research and Sun Microsystems Laboratories.
  • Manaure Francisquez

    Hi Greg,
    I just thought I would let you know that I found your posts on CUDA Fortran extremely useful and easy to follow. A great introduction that is informative and enjoyable.
    Maybe I’ll take a peek at CUDA Fortran for Scientists and Engineers next.

    Thank you for your great work.
    Mana

    • Greg Ruetsch

      Hi Mana,
      Thanks for your feedback, I’m glad you enjoy the posts. There are several new CUDA Fortran features slated for 2014 that I’ll be writing about in upcoming posts, I hope you’ll find those useful as well.