This post concludes an introductory series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. The second post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. In this post, I finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experimentpart of the (Facility for Antiproton and Ion Research in Europe (FAIR)). The PANDA work was carried out in the scope of the NVIDIA Application Lab at Jülich.
Inside the PANDA experiment, accelerated antiprotons will collide with protons, forming intermediate and unstable particles (mesons, baryons etc.), which will decay in cascades into stable particles, like electrons and photons. The unstable particles are of particular interest for PANDA, as they give insight into the processes governing this physics regime (QCD). Reconstructing all involved constituent particles of an event lets the physicists form a picture of the process, eventually confirming established physics theories, probing new ones and potentially finding exciting and unexpected results. Continue reading →
This post is the second in a series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. This post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. My next post will finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDAexperiment (Facility for Antiproton and Ion Research in Europe (FAIR)).
Grid Nesting and Synchronization
In the CUDA programming model, a group of blocks of threads that are running a kernel is called a grid. In CUDA Dynamic Parallelism, a parent grid launches kernels called child grids. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size. Note that every thread that encounters a kernel launch executes it. Therefore, if the parent grid has 128 blocks with 64 threads each, and there is no control flow around a child kernel launch, then the grid will perform a total of 8192 kernel launches. If you want a kernel to only launch one child grid per thread block, you should launch the kernel from a single thread of each block as in the following code.
Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU. For applications consisting of “parallel for” loops the bulk parallel model is not too limiting, but some parallel patterns—such as nested parallelism—cannot be expressed so easily. Nested parallelism arises naturally in many applications, such as those using adaptive grids, which are often used in real-world applications to reduce computational complexity while capturing the relevant level of detail. Flat, bulk parallel applications have to use either a fine grid, and do unwanted computations, or use a coarse grid and lose finer details.
CUDA 5.0 introduced Dynamic Parallelism, which makes it possible to launch kernels from threads running on the device; threads can launch more threads. An application can launch a coarse-grained kernel which in turn launches finer-grained kernels to do work where needed. This avoids unwanted computations while capturing all interesting details, as Figure 1 shows.
Dynamic parallelism is generally useful for problems where nested parallelism cannot be avoided. This includes, but is not limited to, the following classes of algorithms:
algorithms using hierarchical data structures, such as adaptive grids;
algorithms using recursion, where each level of recursion has parallelism, such as quicksort;
algorithms where work is naturally split into independent batches, where each batch involves complex parallel processing but cannot fully use a single GPU.
This post introduces Dynamic Parallelism by example using a fast hierarchical algorithm for computing images of the Mandelbrot set. This is the first of a three part series on CUDA Dynamic Parallelism:
Adaptive Parallel Computation – Dynamic Parallelism overview and example (this post);
API and Principles – Advanced topics in Dynamic Parallelism, including device-side streams and synchronization;