Simulation / Modeling / Design

CUDA Dynamic Parallelism API and Principles

This post is the second in a series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. This post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. My next post will finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment (Facility for Antiproton and Ion Research in Europe (FAIR)).

Grid Nesting and Synchronization

In the CUDA programming model, a group of blocks of threads that are running a kernel is called a grid. In CUDA Dynamic Parallelism, a parent grid launches kernels called child grids. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size. Note that every thread that encounters a kernel launch executes it. Therefore, if the parent grid has 128 blocks with 64 threads each, and there is no control flow around a child kernel launch, then the grid will perform a total of 8192 kernel launches. If you want a kernel to only launch one child grid per thread block, you should launch the kernel from a single thread of each block as in the following code.

if(threadIdx.x == 0) {
  child_k <<< (n + bs - 1) / bs, bs >>> ();
}


Grids launched with dynamic parallelism are fully nested. This means that child grids always complete before the parent grids that launch them, even if there is no explicit synchronization, as shown in Figure 1.

If the parent kernel needs results computed by the child kernel to do its own work, it must ensure that the child grid has finished execution before continuing by explicitly synchronizing using cudaDeviceSynchronize(void). This function waits for completion of all grids previously launched by the thread block from which it has been called. Because of nesting, it also ensures that any descendants of grids launched by the thread block have completed. cudaDeviceSynchronize() returns any error code of that has occurred in any of those kernels.

Note that when a thread calls cudaDeviceSynchronize(), it is not aware which kernel launch constructs has been already executed by other threads in the block. Therefore, if a real block-level synchronization is desired, queueing of the child grids should be ensured by calling __syncthreads() before calling cudaDeviceSynchronize(). Similarly, __syncthreads() should be called afterwards, so that other threads can only continue execution after the synchronization on the child grids has been performed. This is illustrated by the following code.

void threadBlockDeviceSynchronize(void) {
  __syncthreads();
  if(threadIdx.x == 0)
    cudaDeviceSynchronize();
  __syncthreads();
}

In general, calling cudaDeviceSynchronize() is expensive because it can cause the currently running block to be paused and swapped to device memory. So call it only when necessary; in particular, cudaDeviceSynchronize() should not be called at exit from a parent kernel, as implicit synchronization is performed anyway.

Memory Consistency

A parent grid often relies on a child grid reading from and writing to global memory. To make this possible, the CUDA Device Runtime guarantees that parent and child grids have a fully consistent view of global memory (and zero-copy host memory) when the child starts and ends, as shown in Figure 2.

This means that if the parent writes to a location, and then launches a child grid, the child is guaranteed to see the value actually written by the parent. Similarly, if the child writes a memory location, and the parent performs synchronization, the parent is guaranteed to see the value written by the child. This also means that if several child grids are executed sequentially (for example in the same stream), then any writes performed by earlier child grids are seen by child grids started later, even if no synchronization has occurred between them.

Note that the view of global memory is not consistent when the kernel launch construct is executed. That means that in the following code example, it is not defined whether the child kernel reads and prints the value 1 or 2. To avoid race conditions, memory which can be read by the child should not be written by the parent after kernel launch but before explicit synchronization.

__device__ int v = 0;

__global__ void child_k(void) {
  printf("v = %d\n", v);
}

__global__ void parent_k(void) {
  v = 1;
  child_k <<< 1, 1 >>>> ();
  v = 2; // RACE CONDITION
  cudaDeviceSynchronize();
}

Passing Pointers to Child Grids

The table below summarizes the limitations on the kinds of pointers that can be passed to child kernels.

Can be passed Cannot be passed
  • global memory (incl. __device__ variables and malloc’ed memory)
  • zero-copy host memory
  • constant memory (inherited and not writeable)
  • shared memory (__shared__ variables)
  • local memory (incl. stack variables)

The results of dereferencing a pointer in a child grid that cannot be legally passed to it are undefined. The following code on the left is illegal, while the code on the right is OK.

// common __global__ void child_k(void *p) { // … *p = res; // … }

__global__ void parent_k(void) {
  // ...
  int v = 0;
  child_k <<< 1, 256 >>> (&v);
  // ...
}

__device__ int v;
__global__ void parent_k(void) {
    // ...
    child_k <<< 1, 256 >>> (&v);
    // ...
}

In fact, the code on the left won’t even compile, because the compiler detects that a pointer to local memory is stored in the parameter buffer, and reports an error. This can be overcome by wrapping the pointer in a structure; however, the code is not likely to work as expected. Because a pointer to local memory can reference a memory location that is legal in the child grid, this can lead to data corruption and subtle, hardly detectable errors. As of CUDA 5.5, errors of this kind are not caught by memory error checkers such as cuda-memcheck.

Note that passing a pointer to a local variable is a fairly common pattern in CPU programs, but it is not allowed with dynamic parallelism. This technique is a common way to return a value from a function on CPUs, but on GPUs we should find another way. Passing a pointer to a global variable is allowed, but is hardly useful, as there are most likely many child grids, and only one global variable. Other options include:

  • using the device malloc() function: while this works, it may be slow, as the current device allocator is not scalable (however, this can be a good option to start with);
  • using pre-allocated data structures: for some applications, it may be known in advance how many child grids will be launched, and it thus becomes possible to use pre-allocated data structures, one per child grid;
  • writing a custom allocator: this is hard and may lead to poor scalability; however, in case the number of child grids is not known in advance, this may be the only solution;
  • using 3rd-party memory allocators: while some are available, e.g. halloc, they haven’t reached production quality so far, and should be used with caution. [Full disclosure: the author wrote halloc.]

Device Streams and Events

By default, grids launched within a thread block are executed sequentially: the next grid starts executing only after the previous one has finished. This happens even if grids are launched by different threads within the block. Often, however, more concurrency is desired; as with host-side kernel launches, we can use CUDA streams to achieve this.

All streams created on the device are non-blocking; that is, they do not support implicit synchronization with the default NULL stream. Therefore, what follows is the only way to create a stream in device code.

cudaStream_t s;
cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);

Kernels launched in different streams can execute concurrently. Note that streams created in different thread blocks, including the default NULL streams, are considered different. However, concurrent execution is in no case guaranteed: it is primarily intended for better utilization of GPU resources. Any program that depends on specific grids or thread blocks executing concurrently is ill-formed.

Once created, a device stream can be used by any thread within the same thread block. However, it cannot be used after the thread block finishes executing, on other thread blocks or on the host. Similarly, streams created on the host cannot be used on the device.

There is currently no per-stream synchronization available in the CUDA Device Runtime; the only way to synchronize today is to wait for all work launched by a given thread to finish by calling cudaDeviceSynchronize().

As on host, a stream is destroyed by calling cudaStreamDestroy(stream) function. If there is still work being done on the stream stream, the function returns immediately, and the resources associated with the stream will be released automatically once all work on it has been finished. Therefore, it is not necessary to explicitly synchronize to wait for all work on the stream to finish just to destroy it.

While on-GPU events are also supported, the only supported use of  them is ordering between different streams using cudaStreamWaitEvent(). Using events for timing or per-event synchronization is currently not supported.

Recursion Depth and Device Limits

Recursive programs usually have an associated recursion depth. For dynamic parallelism, there are two concepts of depth involved:

  • nesting depth, which is the deepest nesting level of recursive grid launches, with kernels launched from the host having depth 0;
  • synchronization depth, which is the deepest nesting level at which cudaDeviceSynchronize() is called.

Usually, synchronization depth is one less than nesting depth, but if not every level synchronizes explicitly, it can be significantly lower. Consider the following example code.

__global__ void recursive_k(int depth) {
  // up to depth 6
  if(depth == 6)
    return;
  // launch 1 kernel (2 if depth == 3)
  if(threadIdx.x == 0) {
    recursive_k <<< 1, 1 >>> (depth + 1);
    if(depth == 3) {
      cudaDeviceSynchronize();
      recursive_k <<< 1, 1 >>> (depth + 1);
    }
  }
}

// launch from host
recursive_k <<< 1, 1 >>> (0);

Here each kernel is launched with its nesting level passed in the parameter depth. Although the nesting depth is 6, the maximum synchronization depth is only 3, because that is the deepest level that synchronizes explicitly.

For each level up to the maximum synchronization depth, it is possible that the parent block is swapped out and replaced with a child grid, resulting in storage of the parent grid’s context data. On GK110 class GPUs (Geforce GTX 780 Ti, Tesla K20, etc.), up to 150 MiB of memory may be reserved per nesting level, depending on the maximum number of threads available on the device (amount of memory is a function of chip size; smaller chips will will use less memory). Note that extra memory is reserved even if it is not actually used.

Current maximum synchronization depth, and thus the amount of memory reserved, is indicated by cudaLimitDevRuntimeSyncDepth device limit. If cudaDeviceSynchronize() is called deeper than current maximum synchronization depth, it returns an error, and no synchronization happens. Note that in this case, the child grids that have already been queued will still be launched, though end results may be incorrect due to missed synchronization.

By default, memory is reserved for two levels of synchronization. This means that to make the example above work, the maximum synchronization depth needs to be increased. Note that in CUDA runtime, cudaLimitDevRuntimeSyncDepth limit is actually the number of levels for which storage should be reserved, including kernels launched from host. Therefore, it should be set to maximum synchronization depth plus 1 by calling cudaDeviceLimit(cudaLimitDevRuntimeSyncDepth, 4).

There is also a hardware limit on maximum nesting depth, and thus synchronization depth; as of Compute Capability 3.5, the hardware limit on depth is 24 levels.

Another important limit is the number of pending child grids, or grids that can be either running, suspended or in launch queue waiting to run. Pending launch buffer is the data structure used to maintain the launch queue as well as track currently running kernels. If a kernel launch is executed when the buffer is full, the behavior depends on the version of CUDA used. With CUDA 5, the grid is simply discarded, and is never launched. A subsequent call to cudaGetLastError() returns cudaErrorLaunchPendingCountExceeded. By default, space is reserved for 2048 pending child grids; this can be extended by setting the appropriate device limit, as in the following code.

cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount, 32768);

With CUDA 6 and later, a variable-size virtualized pool has been added to fixed-size pool for the pending launch buffer. In this case, the limit cudaLimitDevRuntimePendingLaunchCount refers to the size of the fixed pool. The runtime first tries to add the newly launched grid to the fixed-size pool, and if it is full, uses the virtualized pool. While this means that grids are queued successfully, the costs of using the virtualized pool are higher than those of the fixed-size pool. For the code below, Figure 3 demonstrates the performance penalty associated with using the virtualized pool. The results were obtained on K20X with CUDA 6.0 and NVidia Driver version 331.49. A total of about 300,000 kernels are launched during the experiment, and it can be seen that wrong fixed pool size leads to up to 20x slower execution.

// kernel code
__global__ void pending_k(int depth) {
  if(depth == 6)
    return;
  cudaStream_t s;
  cudaStreamCreateWithFlags(&s, cudaStreamNonBlocking);
  pending_k <<< 1, 8, 0, s >>> (depth + 1);
  cudaStreamDestroy(s);
}  // pending_k

// ...
// launch from host
pending_k <<< 1, 8 >>> (0);

The example above may seem a bit contrived (the size of the fixed-size pool has been artificially reduced to 64). Nevertheless, it demonstrates that it’s always a good idea to be aware of how many grids are pending. Due to the massively parallel nature of GPUs and the absence of throttling mechanism of any kind, it is easy to launch a tremendous number of kernels even with a very low recursion depth. Such a “fork bomb” will consume lots of memory, and may impact performance and even correctness of your application.

That wraps up my in-depth overview of CUDA Dynamic Parallelism principles. Stay tuned for my next post, in which I’ll share how Dynamic Parallelism made it easier and more efficient to implement Triplet Finder, an online track reconstruction algorithm for the high-energy physics PANDA experiment which is part of the Facility for Antiproton and Ion Research in Europe (FAIR).

Discuss (19)

Tags