Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory atomic operations.
Shuffle On Down
Efficient parallel reductions exchange data between threads within the same thread block. On earlier hardware this meant using shared memory, which involves writing data to shared memory, synchronizing, and then reading the data back from shared memory. Kepler’s shuffle instruction (SHFL) enables a thread to directly read a register from another thread in the same warp (32 threads). This allows threads in a warp to collectively exchange or broadcast data. As described in the post “Do the Kepler Shuffle”, there are four shuffle intrinsics:
__shfl_xor(), but in this post we only use
__shfl_down(), defined as follows: (You can find a complete description of the other shuffle functions in the CUDA C Programming Guide.)
int __shfl_down(int var, unsigned int delta, int width=warpSize);
__shfl_down() calculates a source lane ID by adding delta to the caller’s lane ID (the lane ID is a thread’s index within its warp, from 0 to 31). The value of var held by the resulting lane ID is returned: this has the effect of shifting var down the warp by delta lanes. If the source lane ID is out of range or the source thread has exited, the calling thread’s own
var is returned. The ID number of the source lane will not wrap around the value of width and so the upper delta lanes will remain unchanged. Note that
width must be one of (2, 4, 8, 16, 32). For brevity, the diagrams that follow show only 8 threads in a warp even though the warp size of all current CUDA GPUs is 32.
As an example, Figure 1 shows the effect of the following two lines of code, where we can see that values are shifted down by 2 threads.
int i = threadIdx.x % 32; int j = __shfl_down(i, 2, 8);
There are three main advantages to using shuffle instead of shared memory: Continue reading