# CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.

Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.

CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage. Continue reading

# CUDA Pro Tip: Profiling MPI Applications

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option --display-map, it’s tedious and error prone. A solution which solves this for the command-line output of nvprof is described here http://www.parallel-computing.pro/index.php/9-cuda/5-sorting-cuda-profiler-output-of-the-mpi-cuda-program . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler (nvvp).

## Profiling MPI applications with nvprof and nvvp

### Collecting data with nvprof

nvprof supports dumping the profile to a file which can be later imported into nvvp. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “%p” in the output file name. nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string “%q{ENV}” can be used to name the output file of nvprof. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (mpirun or mpiexec). E.g. for OpenMPI OMPI_COMM_WORLD_RANK is set to the MPI rank for each launched process.

\$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./simpleMPI
Running on 2 nodes
==18811== NVPROF is profiling process 18811, command: ./simpleMPI
==18813== NVPROF is profiling process 18813, command: ./simpleMPI
Average of square roots is: 0.667279
PASSED
==18813== Generated result file: simpleMPI.1.nvprof
==18811== Generated result file: simpleMPI.0.nvprof

# CUDA Pro Tip: Minimize the Tail Effect

When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the ratio between the number of threads which may run on each multiprocessor (SM) and the maximum number of executable threads per SM (2048 on the Kepler architecture). This value is estimated from the size of the blocks and the amount of resources (registers and shared memory) used by those blocks for a particular GPU and is computed without running the kernel on the GPU. The Achieved Occupancy, on the other hand, is measured from the execution of the kernel (as the number of active warps divided by the number of active cycles compared to the maximum number of executable warps).

Recently, while working on a kernel for a finance benchmark, I could see an Achieved Occupancy of 41.52% whereas the Theoretical Occupancy was 50%. In NVIDIA Nsight Visual Studio Edition, the Instruction per Clock (IPC) showed a lot of load imbalance between the different SMs with respect to the number of executed instructions by the kernel (see the left graph in the figure below).

# CUDA Pro Tip: Improve NVIDIA Visual Profiler Loading of Large Profiles

Some applications launch many tiny kernels, making them prone to very large (100s of megabytes or larger) nvprof timeline dumps, even for application runs of only a handful of seconds.

Such nvprof files may fail to even load when you try to import them into the NVIDIA Visual Profiler (NVVP). One symptom of this problem is that when you click “Finish” on the import screen, NVVP “thinks” for a minute or so, but then just goes right back to the import screen asking you to click Finish again. In other cases, attempting to load a large file can result in NVVP “thinking” about it for many hours.

It turns out that this problem is because of the Java max heap size setting specified in the libnvvp/nvvp.ini file of the CUDA Toolkit installation: the profiler configures the Java VM to cap the heap size at 1GB in order to work even on systems with minimal physical memory.  While this 1GB value is already an improvement over the 512MB setting used in earlier CUDA versions, it is still not enough for some applications, considering that the memory footprint of the profiler can be at least four to five times larger than the input file size.

# CUDA Pro Tip: Fast and Robust Computation of Givens Rotations

A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form

$G(i, j, \theta) = \begin{bmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & c & \cdots & -s & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & s & \cdots & c & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{bmatrix}$,

where the intersections of the $i$th and $j$th columns contain the values $c=cos \theta$ and $s=sin \theta$. Multiplying a vector $x$ by a Givens rotation matrix $G(i, j, \theta)$ represents a rotation of the vector $x$ in the $(i, j)$ plane by $\theta$ radians.

According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in vectors or matrices. Importantly, that means Givens rotations can be used to compute the QR decomposition of a matrix. An important advantage over Householder transformations is that Givens rotations are easy to parallelize. Continue reading

# CUDA Pro Tip: Increase Application Performance with NVIDIA GPU Boost

NVIDIA GPU Boost™ is a feature available on NVIDIA® GeForce® products and NVIDIA® Tesla® products. It makes use of any power headroom to boost application performance. In the case of Tesla, the NVIDIA GPU Boost feature is customized for compute intensive workloads running on clusters. This application note is useful for anyone who wants to take advantage of the power headroom on the Tesla K40 in a server or within a workstation. Note that GPU Boost is a system setting, which means that this Pro Tip applies to any user of a CUDA-accelerated application, not just developers.

The Tesla K40 board targets a specific power budget (235W) when running a highly optimized compute workload, but HPC workloads vary in power consumption and profile, as the graph in Figure 1 shows. This shows that for many applications there is power headroom.  NVIDIA GPU Boost for Tesla allows customers to use available power headroom to select higher graphics clocks using NVML or nvidia-smi.

A great post by Saad Rahim on the Acceleware Blog
covers everything you need to know to use GPU Boost. In the post, Saad benchmarks two applications with varying clocks on K40: Reverse Time Migration (RTM), a depth migration algorithm used to image complex geologies; and a Finite-difference time-domain (FDTD) electromagnetic solver. Continue reading

# CUDA Pro Tip: How to Call Batched cuBLAS routines from CUDA Fortran

When dealing with small arrays and matrices, one method of exposing parallelism on the GPU is to execute the same cuBLAS call on multiple independent systems simultaneously. While you can do this manually by calling multiple cuBLAS kernels across multiple CUDA streams, batched cuBLAS routines enable such parallelism automatically for certain operations (GEMM, GETRF, GETRI, and TRSM).  In this post I’ll show you how to leverage these batched routines from CUDA Fortran.

The C interface batched cuBLAS functions use an array of pointers as one of their arguments, where each pointer in the array points to an independent matrix. This poses a problem for Fortran, which does not allow arrays of pointers. To accommodate this argument, we can make use of the data types declared in the ISO_C_BINDING module, in particular the c_devptr type.  Let’s illustrate this with a code that calls the batched SGETRF cuBLAS routine.

## Writing Interfaces to Batched cuBLAS Routines

At the time of writing this post, the batched cuBLAS routines are not in the CUDA Fortran cublas module, so we first need to define the interface to the cublasSgetrfBatched() call:

interface
integer(c_int) function &
cublasSgetrfBatched(h,n,Aarray,lda,ipvt,info,batchSize) &
bind(c,name='cublasSgetrfBatched')
use iso_c_binding
use cublas
type(cublasHandle), value :: h
integer(c_int), value :: n
type(c_devptr), device :: Aarray(*)
integer(c_int), value :: lda
integer(c_int), device :: ipvt(*)
integer(c_int), device :: info(*)
integer(c_int), value :: batchSize
end function cublasSgetrfBatched
end interface

The arguments of cublasSgetrfBatched() are: Continue reading

# CUDA Pro Tip: Do The Kepler Shuffle

When writing parallel programs, you will often need to communicate values between parallel threads. The typical way to do this in CUDA programming is to use shared memory. But the NVIDIA Kepler GPU architecture introduced a way to directly share data between threads that are part of the same warp. On Kepler, threads of a warp can read each others’ registers by using a new instruction called SHFL, or “shuffle”.

In upcoming posts here on Parallel Forall we will demonstrate uses of shuffle. To prepare, I highly recommend watching the following recording of a GTC 2013 talk by Julien Demouth entitled “Kepler’s SHUFFLE (SHFL): Tips and Tricks”. In the talk, Julien covers many uses for shuffle, including reductions, scans, transpose, and sorting, demonstrating that shuffle is always faster than safe uses of shared memory, and never slower than unsafe uses of shared memory.

Earlier this year on the Acceleware blog, Kelly Goss wrote a detailed post about shuffle, including a detailed example. Like Julien, Kelly provided several reasons to use shuffle. Continue reading

# CUDA Pro Tip: Control GPU Visibility with CUDA_VISIBLE_DEVICES

As a CUDA developer, you will often need to control which devices your application uses. In a short-but-sweet post on the Acceleware blog, Chris Mason writes:

Does your CUDA application need to target a specific GPU? If you are writing GPU enabled code, you would typically use a device query to select the desired GPUs. However, a quick and easy solution for testing is to use the environment variable CUDA_VISIBLE_DEVICES to restrict the devices that your CUDA application sees. This can be useful if you are attempting to share resources on a node or you want your GPU enabled executable to target a specific GPU

As Chris points out, robust applications should use the CUDA API to enumerate and select devices with appropriate capabilities at run time. To learn how, read the section on Device Enumeration in the CUDA Programming Guide. But the CUDA_VISIBLE_DEVICES environment variable is handy for restricting execution to a specific device or set of devices for debugging and testing.  You can also use it to control execution of applications for which you don’t have source code, or to launch multiple instances of a program on a single machine, each with its own environment and set of visible devices. Continue reading

# CUDA Pro Tip: Increase Performance with Vectorized Memory Access

Many CUDA kernels are bandwidth bound, and the increasing ratio of flops to bandwidth in new hardware results in more bandwidth bound kernels. This makes it very important to take steps to mitigate bandwidth bottlenecks in your code. In this post I will show you how to use vector loads and stores in CUDA C/C++ to help increase bandwidth utilization while decreasing the number of executed instructions.

Let’s begin by looking at the following simple memory copy kernel.

__global__ void device_copy_scalar_kernel(int* d_in, int* d_out, int N) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = idx; i < N; i += blockDim.x * gridDim.x) {
d_out[i] = d_in[i];
}
}

void device_copy_scalar(int* d_in, int* d_out, int N)
{
}