Using Shared Memory in CUDA Fortran

In the previous post, I looked at how global memory accesses by a group of threads can be coalesced into a single transaction, and how alignment and stride affect coalescing for various generations of CUDA hardware. For recent versions of CUDA hardware, misaligned data accesses are not a big issue. However, striding through global memory is problematic regardless of the generation of the CUDA hardware, and would seem to be unavoidable in many cases, such as when accessing elements in a multidimensional array along the second and higher dimensions. However, it is possible to coalesce memory access in such cases if we use shared memory. Before I show you how to avoid striding through global memory in the next post, first I need to describe shared memory in some detail.

Shared Memory

Because it is on-chip, shared memory is much faster than local and global memory. In fact, shared memory latency is roughly 100x lower than uncached global memory latency (provided that there are no bank conflicts between the threads, which we will examine later in this post). Shared memory is allocated per thread block, so all threads in the block have access to the same shared memory. Threads can access data in shared memory loaded from global memory by other threads within the same thread block. This capability (combined with thread synchronization) has a number of uses, such as user-managed data caches, high-performance cooperative parallel algorithms (such as parallel reductions), and to facilitate global memory coalescing in cases where it would otherwise not be possible.

Thread Synchronization

When sharing data between threads, we need to be careful to avoid race conditions, because while threads in a block run logically in parallel, not all threads can execute physically at the same time. Let’s say that two threads A and B each load a data element from global memory and store it to shared memory. Then, thread A wants to read B’s element from shared memory, and vice versa. Let’s assume that A and B are threads in two different warps. If B has not finished writing its element before A tries to read it, we have a race condition, which can lead to undefined behavior and incorrect results.

To ensure correct results when parallel threads cooperate, we must synchronize the threads. CUDA provides a simple barrier synchronization primitive, syncthreads(). A thread’s execution can only proceed past a syncthreads() after all threads in its block have executed the syncthreads(). Thus, we can avoid the race condition described above by calling syncthreads() after the store to shared memory and before any threads load from shared memory.

Shared Memory Example

Declare shared memory in CUDA Fortran using the shared variable qualifier in device code. There are multiple ways to declare shared memory inside a kernel, depending on whether the amount of memory is known at compile time or at run time. The following complete code illustrates various methods of using shared memory.

module reverse_m
  implicit none
  integer, device :: n_d
contains
  attributes(global) subroutine staticReverse(d)
    real :: d(:) 
    integer :: t, tr
    real, shared :: s(64) 
    t = threadIdx%x
    tr = size(d)-t+1
    s(t) = d(t)
    call syncthreads() 
    d(t) = s(tr)
  end subroutine staticReverse

  attributes(global) subroutine dynamicReverse1(d)
    real :: d(:) 
    integer :: t, tr
    real, shared :: s(*)
    t = threadIdx%x
    tr = size(d)-t+1
    s(t) = d(t)
    call syncthreads() 
    d(t) = s(tr)
  end subroutine dynamicReverse1

  attributes(global) subroutine dynamicReverse2(d, nSize)
    real :: d(nSize) 
    integer, value :: nSize
    integer :: t, tr
    real, shared :: s(nSize)
    t = threadIdx%x
    tr = nSize-t+1
    s(t) = d(t)
    call syncthreads() 
    d(t) = s(tr)
  end subroutine dynamicReverse2

  attributes(global) subroutine dynamicReverse3(d)
    real :: d(n_d) 
    real, shared :: s(n_d)
    integer :: t, tr
    t = threadIdx%x
    tr = n_d-t+1
    s(t) = d(t)
    call syncthreads() 
    d(t) = s(tr)
  end subroutine dynamicReverse3
end module reverse_m

program sharedExample
  use cudafor
  use reverse_m
  implicit none
  integer, parameter :: n = 64
  real :: a(n), r(n), d(n)
  real, device :: d_d(n)
  type(dim3) :: grid,threadblock
  integer :: i 

  threadBlock = dim3(n,1,1)
  grid = dim3(1,1,1)
  do i = 1, n
    a(i) = i
    r(i) = n-i+1
  enddo

  ! run version with static shared memory
  d_d = a
  call staticReverse<<<grid,threadblock>>>(d_d)
  d = d_d
  write(*,*) 'Static case max error:', maxval(abs(r-d))

  ! run dynamic shared memory version 1
  d_d = a
  call dynamicReverse1<<<grid,threadblock,4*threadblock%x>>>(d_d)
  d = d_d
  write(*,*) 'Dynamic case 1 max error:', maxval(abs(r-d))

  ! run dynamic shared memory version 2
  d_d = a
  call dynamicReverse2<<<grid,threadblock,4*threadblock%x>>>(d_d,n)
  d = d_d
  write(*,*) 'Dynamic case 2 max error:', maxval(abs(r-d))

  ! run dynamic shared memory version 3
  n_d = n ! n_d declared in reverse_m
  d_d = a
  call dynamicReverse3<<<grid,threadblock,4*threadblock%x>>>(d_d)
  d = d_d
  write(*,*) 'Dynamic case 3 max error:', maxval(abs(r-d)) 
end program sharedExample

This code reverses the data in a 64-element array using shared memory. All of the kernel codes are very similar, the main difference is how the shared memory arrays are declared, and how the kernels are invoked.

Static Shared Memory

If the shared memory array size is known at compile time, as in the staticReverse kernel, then we can explicitly declare an array of that size, using either an integer parameter or a literal as with the array s.

 attributes(global) subroutine staticReverse(d)
    real :: d(:) 
    integer :: t, tr
    real, shared :: s(64) 
    t = threadIdx%x
    tr = size(d)-t+1
    s(t) = d(t)
    call syncthreads() 
    d(t) = s(tr)
  end subroutine staticReverse

In this kernel, t and tr are the two indices representing the original and reverse order. Threads copy the data from global memory to shared memory with the statement s(t) = d(t), and the reversal is done two lines later with the statement d(t) = s(tr). But before executing this final line in which each thread accesses data in shared memory that was written by another thread, remember that we need to make sure all threads have completed the loads to shared memory, by calling syncthreads().

The reason shared memory is used in this example is to facilitate global memory coalescing on older CUDA devices (Compute Capability 1.1 or earlier). Optimal global memory coalescing is achieved for both reads and writes because global memory is always accessed through the linear, aligned index t. The reversed index tr is only used to access shared memory, which does not have the access restrictions global memory has for optimal performance. The only performance issue with shared memory is bank conflicts, which is discussed later. (Note that on devices of Compute Capability 1.2 or later, the memory system can fully coalesce even the reversed index stores to global memory. But this technique is still useful for other access patterns, as I’ll show in the next post.)

Dynamic Shared Memory

The other three kernels in this example use dynamic shared memory, where the amount of shared memory is not known at compile time and must be specified (in bytes) when the kernel is invoked in the optional third execution configuration parameter, as in the following excerpt.

  call dynamicReverse1<<<grid,<span class="hiddenSpellError">threadBlock,4*threadBlock%x>>>(d_d)
  ...
  call dynamicReverse2<<<grid,threadblock,4*threadblock%x>>>(d_d,n)
  ...
  call dynamicReverse3<<<grid,threadblock,4*threadblock%x>>>(d_d)</span>

The first dynamic shared memory kernel, dynamicReverse1(), declares the shared memory array using an assumed-size array syntax, s(*). The size is implicitly determined from the third execution configuration parameter when the kernel is launched. The remainder of the kernel code is identical to the staticReverse() kernel.

As of version 11.6 of the PGI compiler, you can also can use dynamic shared memory via automatic arrays, as shown in dynamicReverse2() and dynamicReverse3(). In these cases, the dimension of the dynamic shared memory array is specified by an integer that is in scope. In dynamicReverse2 the subroutine argument nSize is used to declare the shared memory array size, and in dynamicReverse3() the device variable n_d declared in the beginning of the module is used to declare the shared memory array size.

  attributes(global) subroutine dynamicReverse2(d, nSize)
    real :: d(nSize) 
    integer, value :: nSize
    integer :: t, tr
    real, shared :: s(nSize)
    t = threadIdx%x
    tr = nSize-t+1
    s(t) = d(t)
    call syncthreads() 
    d(t) = s(tr)
  end subroutine dynamicReverse2

attributes(global) subroutine dynamicReverse3(d)
    real :: d(n_d) 
    real, shared :: s(n_d)
    integer :: t, tr
    t = threadIdx%x
    tr = n_d-t+1
    s(t) = d(t)
    call syncthreads() 
    d(t) = s(tr)
  end subroutine dynamicReverse3

Note that in both these cases the amount of dynamic memory must still be specified in the third parameter of the execution configuration when the kernel is invoked.

Given these options for declaring dynamic shared memory, which one should you use? If you want to use multiple dynamic shared memory arrays, especially if they are of different types, then you need to use the automatic arrays as in dynamicReverse2() and dynamicReverse3(), because the compiler cannot determine how to divide the total amount of dynamic shared memory amongst the assumed-size array declarations. Aside from that case, the choice is yours, and there is no performance difference between these methods of declaration.

Shared memory bank conflicts

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans b distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.

However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

To minimize bank conflicts, it is important to understand how memory addresses map to memory banks. Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

For devices of compute capability 2.0, the warp size is 32 threads and the number of banks is also 32. A shared memory request for a warp is not split as with devices of compute capability 1.x, meaning that bank conflicts can occur between threads in the first half of a warp and threads in the second half of the same warp.

Devices of compute capability 3.x have configurable bank size, which can be set using cudaDeviceSetSharedMemConfig() to either four bytes (cudaSharedMemBankSizeFourByte, the default) or eight bytes (cudaSharedMemBankSizeEightByte). Setting the bank size to eight bytes can help avoid shared memory bank conflicts when accessing double precision data.

Configuring the amount of shared memory

On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory. For devices of compute capability 2.x, there are two settings, 48KB shared memory / 16KB L1 cache, and 16KB shared memory / 48KB L1 cache. By default the 48KB shared memory setting is used. This can be configured during runtime from the host for all kernels using cudaDeviceSetCacheConfig() or on a per-kernel basis using cudaFuncSetCacheConfig(). These accept one of three options: cudaFuncCachePreferNone, cudaFuncCachePreferShared, and cudaFuncCachePreferL1. The driver will honor the specified preference except when a kernel requires more shared memory per thread block than available in the specified configuration. Devices of compute capability 3.x allow a third setting of 32KB shared memory / 32KB L1 cache which can be obtained using the option cudaFuncCachePreferEqual.

Summary

Shared memory is a powerful feature for writing well optimized CUDA code. Access to shared memory is much faster than global memory access because it is located on chip, and since shared memory is shared amongst threads in a thread block, it provides a mechanism for threads to cooperate. One way to use shared memory that leverages such thread cooperation is to enable global memory coalescing, as demonstrated by the array reversal in this post. By reversing the array using shared memory we are able to have all global memory reads and writes performed with unit stride, achieving full coalescing on any CUDA GPU. In the next post I will continue our discussion of shared memory by using it to optimize a matrix transpose.

∥∀

About Greg Ruetsch

Greg Ruetsch is a Senior Applied Engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. He holds a Bachelor’s degree in mechanical and aerospace engineering from Rutgers University and a Ph.D. in applied mathematics from Brown University. Prior to joining NVIDIA he has held research positions at Stanford University’s Center for Turbulence Research and Sun Microsystems Laboratories.