CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.

Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.

CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.

CUDA 6.5 also introduces occupancy-based launch configurator APIs, cudaOccupancyMaxPotentialBlockSize and cudaOccupancyMaxPotentialBlockSizeVariableSMem, which heuristically calculate a block size that achieves the maximum multiprocessor-level occupancy. You can use the VariableSmem version for kernels where the amount of shared memory allocated depends on the number of threads per block. Note that there are also CUDA driver API equivalents of these functions. The following example demonstrates the use of these APIs. It first chooses a reasonable block size by calling cudaOccupancyMaxPotentialBlockSize, and then calculates the theoretical maximum occupancy the kernel will achieve on the present device by calling cudaGetDeviceProperties and cudaOccupancyMaxActiveBlocksPerMultiprocessor.

#include "stdio.h"

__global__ void MyKernel(int *array, int arrayCount) 
{ 
  int idx = threadIdx.x + blockIdx.x * blockDim.x; 
  if (idx < arrayCount) 
  { 
    array[idx] *= array[idx]; 
  } 
} 

void launchMyKernel(int *array, int arrayCount) 
{ 
  int blockSize;   // The launch configurator returned block size 
  int minGridSize; // The minimum grid size needed to achieve the 
                   // maximum occupancy for a full device launch 
  int gridSize;    // The actual grid size needed, based on input size 

  cudaOccupancyMaxPotentialBlockSize( &minGridSize, &blockSize, 
                                      MyKernel, 0, 0); 
  // Round up according to array size 
  gridSize = (arrayCount + blockSize - 1) / blockSize; 

  MyKernel<<< gridSize, blockSize >>>(array, arrayCount); 

  cudaDeviceSynchronize(); 

  // calculate theoretical occupancy
  int maxActiveBlocks;
  cudaOccupancyMaxActiveBlocksPerMultiprocessor( &maxActiveBlocks, 
                                                 MyKernel, blockSize, 
                                                 0);

  int device;
  cudaDeviceProp props;
  cudaGetDevice(&device);
  cudaGetDeviceProperties(&props, device);

  float occupancy = (maxActiveBlocks * blockSize / props.warpSize) / 
                    (float)(props.maxThreadsPerMultiProcessor / 
                            props.warpSize);

  printf("Launched blocks of size %d. Theoretical occupancy: %f\n", 
         blockSize, occupancy);
}

cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.

The CUDA Toolkit version 6.5 also provides a self-documenting, standalone occupancy calculator and launch configurator implementation in <CUDA_Toolkit_Path>/include/cuda_occupancy.h for any use cases that cannot depend on the CUDA software stack. A spreadsheet version of the occupancy calculator is also included (and has been for many CUDA releases). The spreadsheet version is particularly useful as a learning tool that visualizes the impact of changes to the parameters that affect occupancy (block size, registers per thread, and shared memory per thread). You can find more information in the CUDA C Programming Guide and CUDA Runtime API Reference.

CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

Related resources

Tags

About the Authors

CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

Related resources

Tags

About the Authors

Comments

Related posts

Optimizing GPU Utilization with Nsight Compute 2021.3

CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling

CUDA Pro Tip: Minimize the Tail Effect

CUDA Pro Tip: Increase Performance with Vectorized Memory Access

How to Access Global Memory Efficiently in CUDA Fortran Kernels

Related posts

CUDA Pro Tip: The Fast Way to Query Device Properties

Pro Tip: Improved GLSL Syntax for Vulkan DescriptorSet Indexing

Pro Tip: Pinpointing Runtime Errors in CUDA Fortran

Pro Tip: Linking OpenGL for Server-Side Rendering

Pro Tip: cuBLAS Strided Batched Matrix Multiply