GPUProTip_179x115

Voting and Shuffling to Optimize Atomic Operations

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of this algorithm, a number of particle parameters are summed to calculate certain cell parameters. This was in the days of the Tesla GPU architecture (such as GT200 GPUs, Compute Capability 1.x), which had poor atomic operation performance. A linked list approach I developed worked well on Tesla and Fermi as an alternative to atomic adds but performed poorly on Kepler GPUs. However, atomic operations are much faster on the Kepler and Maxwell architectures, so it makes sense to use atomic adds.

These types of summations are not limited to MPC or particle-in-cell codes, but, to some extent, occur whenever data elements are aggregated by key. For data elements sorted and combined by key with a large number of possible values, pre-combining elements with the same key at warp level can lead to a significant speed-up. In this post, I will describe algorithms for speeding up your summations (or similar aggregations) for problems with a large number of keys where there is a reasonable correlation between the thread index and the key. This is usually the case for elements that are at least partially sorted. Unfortunately, this argument works in both directions: these algorithms are not for you if your number of keys is small or your distribution of keys is random.  To clarify: by a “large” number of keys I mean more than could be handled if all bins were put into shared memory.

Note that this technique is related to a previously posted technique called warp-aggregated atomics by Andrey Adinetz, and also to the post Fast Histograms Using Shared Atomics on Maxwell by Nikolay Sakharnykh. The main difference here is that we are aggregating many groups, each designated by a key (to compute a histogram, for example). So you could consider this technique “warp-aggregated atomic reduction by key”. Continue reading

GPUProTip_179x115

GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++

This post demonstrates the practical utility of CUDA’s sinpi() and cospi() functions in the context of distance calculations on earth. With the advent of location-aware and geospatial applications and geographical information systems (GIS), these distance computations have become commonplace.

A great circle divides a sphere into two hemispheres.
A great circle divides a sphere into two hemispheres. Image: Jhbdel at en.wikipedia [CC BY-SA 3.0], via Wikimedia Commons
Wikipedia defines a great circle as

A great circle, also known as an orthodrome or Riemannian circle, of a sphere is the intersection of the sphere and a plane which passes through the center point of the sphere.

For almost any pair of points on the surface of a sphere, the shortest (surface) distance between these points is the path along the great circle between them. If you have ever flown from Europe to the west coast of North America and wondered why you passed over Greenland, your flight most likely followed a great circle path in order to conserve fuel. Continue reading

GPUProTip_179x115

GPU Pro Tip: Lerp Faster in C++

Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units).

Linear Interpolation
Linear Interpolation (from Wikipedia)

You can enable linear interpolation (also known as linear filtering) on texture fetches in CUDA kernels. This hardware filtering uses a low-precision interpolant, so for this and other reasons it’s common to lerp in software.

The standard way to lerp is:

(1-t)*v0 + t*v1

Here’s a generic host/device function that performs a lerp:

template <typename T>
__host__ __device__
inline T lerp(T v0, T v1, T t) {
    return (1-t)*v0 + t*v1;
}

But we can do better. Continue reading

maxwell_thumb

GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let’s focus on one of the most common use cases: an image histogram using 256 bins for each color channel. Even though we’ll use a specific problem setup the same algorithms can benefit other computational domains as well.

A basic serial image histogram computation is relatively simple. For each pixel of the image and for each RGB color channel we find a corresponding integer bin from 0 to 255 and increment its value. Atomic operations are a natural way of implementing histograms on parallel architectures. Depending on the input distribution, some bins will be used much more than others, so it is necessary to support efficient accumulation of the values across the full memory hierarchy. This is similar to reduction and scan operations, but the main challenge with histograms is that the output location for each element is not known prior to reading its value. Therefore, it is impossible to create a generic parallel accumulation scheme that completely avoids collisions. Histograms are now much easier to handle on GPU architectures thanks to the improved atomics performance in Kepler and native support of shared memory atomics in Maxwell.

histogram algorithm
Figure 1: The two-phase parallel histogram algorithm.

Our histogram implementation has two phases and two corresponding CUDA C++ kernels, as Figure 1 shows. In the first phase each CUDA thread block processes a region of the image and accumulates a corresponding local histogram, storing the local histogram in global memory at the end of the phase. The second kernel accumulates all per-block histograms into the final histogram stored in global memory. The work separation between blocks in the first phase reduces contention when accumulating values into the same bin. Continue reading

cuda_pro_tip

GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA

Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of factors. In this post I’ll cover several common scenarios ranging from fast static indexing to more complex and challenging use cases.

Static indexing

Before discussing dynamic indexing let’s briefly look at static indexing. For small arrays where all indices are known constants at compile time, as in the following sample code, the compiler places all accessed elements of the array into registers.

__global__ void kernel1(float * buf)
{
    float a[2];
    ...
    float sum = a[0] + a[1];
    ...
}

This way array elements are accessed in the fastest way possible: math instructions use the data directly without loads and stores.

A slightly more complex (and probably more useful) case is an unrolled loop over the indices of the array. In the following code the compiler is also capable of assigning the accessed array elements to registers.

__global__ void kernel2(float * buf)
{
    float a[5];
    ...
    float sum = 0.0f;
    #pragma unroll
    for(int i = 0; i < 5; ++i)
        sum += a[i];
    ...
}

Here we tell the compiler to unroll the loop with the directive #pragma unroll, effectively replacing the loop with all the iterations listed explicitly, as in the following snippet.

sum += a[0];
sum += a[1];
sum += a[2];
sum += a[3];
sum += a[4];

All the indices are now constants, so the compiler puts the whole array into registers. Continue reading

CUDA 7

GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with respect to each other. [See the post How to Overlap Data Transfers in CUDA C/C++ for an example]

When you execute asynchronous CUDA commands without specifying a stream, the runtime uses the default stream. Before CUDA 7, the default stream is a special stream which implicitly synchronizes with all other streams on the device.

CUDA 7 introduces a ton of powerful new functionality, including a new option to use an independent default stream for every host thread, which avoids the serialization of the legacy default stream. In this post I’ll show you how this can simplify achieving concurrency between kernels and data copies in CUDA programs.
Continue reading

cuda_pro_tip

CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

In this post, I’ll introduce warp-aggregated atomics, a useful technique to improve performance when many threads atomically add to a single counter. In warp aggregation, the threads of a warp first compute a total increment among themselves, and then elect a single thread to atomically add the increment to a global counter. This aggregation reduces the number of atomics performed by up to the number of threads in a warp (up to 32x on current GPUs), and can dramatically improve performance. Moreover, in many typical cases, you can implement warp aggregation as a drop-in replacement for standard atomic operations, so it is useful as a simple way to improve performance of complex applications.

Problem: Filtering by a Predicate

Let’s consider the following problem, which we call filtering: we have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src satisfying the predicate into the destination array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not matter. For our example, we assume that the array elements are integers, and the predicate is true if and only if the element is positive. Here is a sample CPU implementation of filtering.

int filter(int *dst, const int *src, int n) {
  int nres = 0;
  for (int i = 0; i < n; i++)
    if (src[i] > 0)
      dst[nres++] = src[i];
  // return the number of elements copied
  return nres;
}

Filtering, also known as stream compaction, is a common operation, and it is a part of the standard libraries of many programming languages, where it goes under a variety of names, including grep, copy_if, select, FindAll and so on. It is also very often implemented simply as a loop, as it may be very tightly integrated with the surrounding code. Continue reading

cuda_pro_tip

CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing

Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the input data is supplied as low-resolution samples from an 8-bit analog-to-digital (A/D) converter, the samples may first have to be expanded into 32-bit floating point numbers before the FFT and the rest of the processing pipeline can start.

The cuFFT library included with CUDA 6.5 introduces device callbacks to improve performance of this sort of transforms. Callback routines are user-supplied device functions that cuFFT calls when loading or storing data. You can use callbacks to implement many pre- or post-processing operations that required launching separate CUDA kernels before CUDA 6.5.

Example DSP Pipeline

In this blog post we will implement the first stages of a typical DSP pipeline as depicted in Figure 1. We will first discuss a solution without callbacks using multiple custom kernels which we then use as a stepping stone towards a solution based on cuFFT device callbacks. The source code for both versions is available on github.

Pipeline
Figure 1: The processing pipeline for our example before and with CUDA 6.5 callbacks.

Batches of 8-bit fixed-point samples are input to the DSP pipline from an A/D converter. Each sample consists of 1024 data points. For more efficient processing, we group samples into batches of 1000 samples each. Therefore, you can think of this input as a 1000×1024 matrix of 8-bit fixed-point values. Continue reading

cuda_pro_tip

CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule that will help you avoid it within your own software (spoiler alert: it’s in the title!).

Let’s review how to select which GPU to execute CUDA calls on. The CUDA runtime API is state-based, and threads execute cudaSetDevice() to set the current GPU.

cudaError_t cudaSetDevice(int device)

After this call all CUDA API commands go to the current set device until cudaSetDevice() is called again with a different device ID. The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device. This is very important as it allows threads to concurrently submit work to different devices, but forgetting to set the current device in each thread can lead to subtle and hard-to-find bugs like the following example.

cudaSetDevice(1);
cudaMalloc(&a,bytes);

#pragma omp parallel
{
  kernel<<<blocks,threads>>>(a);
}

While at first glance this code may seem bug free, it is incorrect. Continue reading

cuda_pro_tip

CUDA Pro Tip: Optimize for Pointer Aliasing

Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application performance.

What is pointer aliasing?

Two pointers alias if the memory to which they point overlaps. When a compiler can’t determine whether pointers alias, it has to assume that they do. The following simple function shows why this is potentially harmful to performance:

void example1(float *a, float *b, float *c, int i) {
  a[i] = a[i] + c[i];
  b[i] = b[i] + c[i];
}

At first glance it might seem that this function needs to perform three load operations from memory: one for a[i], one for b[i] and one for c[i]. This is incorrect because it assumes that c[i] can be reused once it is loaded. Consider the case where a and c point to the same address. In this case the first line modifies the value c[i] when writing to a[i]. Therefore the compiler must generate code to reload c[i] on the second line, in case it has been modified.

Because the compiler must conservatively assume the pointers alias, it will compile the above code inefficiently, even if the programmer knows that the pointers never alias.

What can I do about aliasing?

Fortunately almost all C/C++ compilers offer a way for the programmer to give the compiler information about pointer aliasing. Continue reading