Voting and Shuffling to Optimize Atomic Operations

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of this algorithm, a number of particle parameters are summed to calculate certain cell parameters. This was in the days of the Tesla GPU architecture (such as GT200 GPUs, Compute Capability 1.x), which had poor atomic operation performance. A linked list approach I developed worked well on Tesla and Fermi as an alternative to atomic adds but performed poorly on Kepler GPUs. However, atomic operations are much faster on the Kepler and Maxwell architectures, so it makes sense to use atomic adds.

These types of summations are not limited to MPC or particle-in-cell codes, but, to some extent, occur whenever data elements are aggregated by key. For data elements sorted and combined by key with a large number of possible values, pre-combining elements with the same key at warp level can lead to a significant speed-up. In this post, I will describe algorithms for speeding up your summations (or similar aggregations) for problems with a large number of keys where there is a reasonable correlation between the thread index and the key. This is usually the case for elements that are at least partially sorted. Unfortunately, this argument works in both directions: these algorithms are not for you if your number of keys is small or your distribution of keys is random.  To clarify: by a “large” number of keys I mean more than could be handled if all bins were put into shared memory.

Note that this technique is related to a previously posted technique called warp-aggregated atomics by Andrey Adinetz, and also to the post Fast Histograms Using Shared Atomics on Maxwell by Nikolay Sakharnykh. The main difference here is that we are aggregating many groups, each designated by a key (to compute a histogram, for example). So you could consider this technique “warp-aggregated atomic reduction by key”. Continue reading


GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let’s focus on one of the most common use cases: an image histogram using 256 bins for each color channel. Even though we’ll use a specific problem setup the same algorithms can benefit other computational domains as well.

A basic serial image histogram computation is relatively simple. For each pixel of the image and for each RGB color channel we find a corresponding integer bin from 0 to 255 and increment its value. Atomic operations are a natural way of implementing histograms on parallel architectures. Depending on the input distribution, some bins will be used much more than others, so it is necessary to support efficient accumulation of the values across the full memory hierarchy. This is similar to reduction and scan operations, but the main challenge with histograms is that the output location for each element is not known prior to reading its value. Therefore, it is impossible to create a generic parallel accumulation scheme that completely avoids collisions. Histograms are now much easier to handle on GPU architectures thanks to the improved atomics performance in Kepler and native support of shared memory atomics in Maxwell.

histogram algorithm
Figure 1: The two-phase parallel histogram algorithm.

Our histogram implementation has two phases and two corresponding CUDA C++ kernels, as Figure 1 shows. In the first phase each CUDA thread block processes a region of the image and accumulates a corresponding local histogram, storing the local histogram in global memory at the end of the phase. The second kernel accumulates all per-block histograms into the final histogram stored in global memory. The work separation between blocks in the first phase reduces contention when accumulating values into the same bin. Continue reading


CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

In this post, I’ll introduce warp-aggregated atomics, a useful technique to improve performance when many threads atomically add to a single counter. In warp aggregation, the threads of a warp first compute a total increment among themselves, and then elect a single thread to atomically add the increment to a global counter. This aggregation reduces the number of atomics performed by up to the number of threads in a warp (up to 32x on current GPUs), and can dramatically improve performance. Moreover, in many typical cases, you can implement warp aggregation as a drop-in replacement for standard atomic operations, so it is useful as a simple way to improve performance of complex applications.

Problem: Filtering by a Predicate

Let’s consider the following problem, which we call filtering: we have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src satisfying the predicate into the destination array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not matter. For our example, we assume that the array elements are integers, and the predicate is true if and only if the element is positive. Here is a sample CPU implementation of filtering.

int filter(int *dst, const int *src, int n) {
  int nres = 0;
  for (int i = 0; i < n; i++)
    if (src[i] > 0)
      dst[nres++] = src[i];
  // return the number of elements copied
  return nres;

Filtering, also known as stream compaction, is a common operation, and it is a part of the standard libraries of many programming languages, where it goes under a variety of names, including grep, copy_if, select, FindAll and so on. It is also very often implemented simply as a loop, as it may be very tightly integrated with the surrounding code. Continue reading


Faster Parallel Reductions on Kepler

Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory atomic operations.

The source code for this post is available on Github.

Shuffle On Down

Efficient parallel reductions exchange data between threads within the same thread block. On earlier hardware this meant using shared memory, which involves writing data to shared memory, synchronizing, and then reading the data back from shared memory. Kepler’s shuffle instruction (SHFL) enables a thread to directly read a register from another thread in the same warp (32 threads). This allows threads in a warp to collectively exchange or broadcast data. As described in the post “Do the Kepler Shuffle”, there are four shuffle intrinsics: __shlf(), __shfl_down(), __shfl_up(), and __shfl_xor(), but in this post we only use __shfl_down(), defined as follows: (You can find a complete description of the other shuffle functions in the CUDA C Programming Guide.)

int __shfl_down(int var, unsigned int delta, int width=warpSize);

__shfl_down() calculates a source lane ID by adding delta to the caller’s lane ID (the lane ID is a thread’s index within its warp, from 0 to 31). The value of var held by the resulting lane ID is returned: this has the effect of shifting var down the warp by delta lanes. If the source lane ID is out of range or the source thread has exited, the calling thread’s own var is returned. The ID number of the source lane will not wrap around the value of width and so the upper delta lanes will remain unchanged. Note that width must be one of (2, 4, 8, 16, 32). For brevity, the diagrams that follow show only 8 threads in a warp even though the warp size of all current CUDA GPUs is 32.

As an example, Figure 1 shows the effect of the following two lines of code, where we can see that values are shifted down by 2 threads.

int i = threadIdx.x % 32;
int j = __shfl_down(i, 2, 8);
Figure 1: The shuffle down instruction.
Figure 1: The shuffle down instruction.

There are three main advantages to using shuffle instead of shared memory: Continue reading

Thinking Parallel, Part III: Tree Construction on the GPU

In part II of this series, we looked at hierarchical tree traversal as a means of quickly identifying pairs of potentially colliding 3D objects and we demonstrated how optimizing for low divergence can result in substantial performance gains on massively parallel processors. Having a fast traversal algorithm is not very useful, though, unless we also have a tree to go with it. In this part, we will close the circle by looking at tree building; specifically, parallel bounding volume hierarchy (BVH) construction. We will also see an example of an algorithmic optimization that would be completely pointless on a single-core processor, but leads to substantial gains in a parallel setting.

There are many use cases for BVHs, and also many ways of constructing them. In our case, construction speed is of the essence. In a physics simulation, objects keep moving from one time step to the next, so we will need a different BVH for each step. Furthermore, we know that we are going to spend only about 0.25 milliseconds in traversing the BVH, so it makes little sense to spend much more on constructing it. One well-known approach for handling dynamic scenes is to essentially recycle the same BVH over and over. The basic idea is to only recalculate the bounding boxes of the nodes according to the new object locations while keeping the hierarchical structure of nodes the same. It is also possible to make small incremental modifications to improve the node structure around objects that have moved the most. However, the main problem plaguing these algorithms is that the tree can deteriorate in unpredictable ways over time, which can result in arbitrarily bad traversal performance in the worst case. To ensure predictable worst-case behavior, we instead choose to build a new tree from scratch every time step. Let’s look at how.

Exploiting the Z-Order Curve

The most promising current parallel BVH construction approach is to use a so-called linear BVH (LBVH). The idea is to simplify the problem by first choosing the order in which the leaf nodes (each corresponding to one object) appear in the tree, and then generating the internal nodes in a way that respects this order. We generally want objects that located close to each other in 3D space to also reside nearby in the hierarchy, so a reasonable choice is to sort them along a space-filling curve. We will use the Z-order curve for simplicity. Continue reading

Thinking Parallel, Part II: Tree Traversal on the GPU

In the first part of this series, we looked at collision detection on the GPU and discussed two commonly used algorithms that find potentially colliding pairs in a set of 3D objects using their axis-aligned bounding boxes (AABBs). Each of the two algorithms has its weaknesses: sort and sweep suffers from high execution divergence, while uniform grid relies on too many simplifying assumptions that limit its applicability in practice.

In this part we will turn our attention to a more sophisticated approach, hierarchical tree traversal, that avoids these issues to a large extent. In the process, we will further explore the role of divergence in parallel programming, and show a couple of practical examples of how to improve it.

Bounding Volume Hierarchy

We will build our approach around a bounding volume hierarchy (BVH), which is a commonly used acceleration structure in ray tracing (for example). A bounding volume hierarchy is essentially a hierarchical grouping of 3D objects, where each group is associated with a conservative bounding box.

Continue reading

Thinking Parallel, Part I: Collision Detection on the GPU

This series of posts aims to highlight some of the main differences between conventional programming and parallel programming on the algorithmic level, using broad-phase collision detection as an example. The first part will give some background, discuss two commonly used approaches, and introduce the concept of divergence. The second part will switch gears to hierarchical tree traversal in order to show how a good single-core algorithm can turn out to be a poor choice in a parallel setting, and vice versa. The third and final part will discuss parallel tree construction, introduce the concept of occupancy, and present a recently published algorithm that has specifically been designed with massive parallelism in mind.

Why Go Parallel?

The computing world is changing. In the past, Moore’s law meant that the performance of integrated circuits would roughly double every two years, and that you could expect any program to automatically run faster on newer processors. However, ever since processor architectures hit the Power Wall around 2002, opportunities for improving the raw performance of individual processor cores have become very limited. Today, Moore’s law no longer means you get faster cores—it means you get more of them. As a result, programs will not get any faster unless they can effectively utilize the ever-increasing number of cores.

Out of the current consumer-level processors, GPUs represent one extreme of this development. NVIDIA GeForce GTX 480, for example, can execute 23,040 threads in parallel, and in practice requires at least 15,000 threads to reach full performance. The benefit of this design point is that individual threads are very lightweight, but together they can achieve extremely high instruction throughput.

One might argue that GPUs are somewhat esoteric processors that are only interesting to scientists and performance enthusiasts working on specialized applications. While this may be true to some extent, the general direction towards more and more parallelism seems inevitable. Learning to write efficient GPU programs not only helps you get a substantial performance boost, but it also highlights some of the fundamental algorithmic considerations that I believe will eventually become relevant for all types of computing. Continue reading

Expressive Algorithmic Programming with Thrust

Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust’s High-Level interface greatly enhances programmer Productivity while enabling performance portability between GPUs and multicore CPUs. Interoperability with established technologies (such as CUDA, TBB, and OpenMP) facilitates integration with existing software. Develop High-Performance applications rapidly with Thrust!

This excerpt from the Thrust home page perfectly summarizes the benefits of the Thrust library. Thrust enables expressive algorithmic programming via a vocabulary of parallel building blocks that let you rapidly develop fast, portable parallel algorithms. If you are a C++ programmer, and especially if you use template libraries like the STL and Boost C++ libraries, then you will find Thrust familiar. Like the STL, Thrust helps you focus on algorithms, rather than on platform-specific implementation details. At the same time, Thrust’s modular design allows low-level customization and interoperation with custom platform-specific code such as CUDA kernels and libraries.

Thrust is High-Level

As described in the article “Thrust, a Productivity-Oriented Library for CUDA”, Thrust aims to solve two types of problems: problems that can be “implemented efficiently without a detailed mapping to the target architecture”, and problems that don’t merit or won’t receive (for whatever reason) significant optimization attention from the programmer. High-level primitives make it easier to capture programmer intent; developers describe what to compute, without dictating how to compute it. This allows the library to make informed decisions about how to implement the intended computation.

Thrust provides an STL-style vector container (with host_vector and device_vector implementations), and a suite of high-level algorithms including searchingsortingcopyingmergingtransformingreordering,reducingprefix sums, and set operations. Here is an oft-repeated complete example program from the Thrust home page, which generates random numbers serially and then transfers them to the GPU where they are sorted.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>

int main(void)
  // generate 32M random numbers serially
  thrust::host_vector h_vec(32 << 20);
  std::generate(h_vec.begin(), h_vec.end(), rand);

  // transfer data to the device
  thrust::device_vector d_vec = h_vec;

  // sort data on the device
  thrust::sort(d_vec.begin(), d_vec.end());

  // transfer data back to host
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

  return 0;

Continue reading