cuda_cpp_simple

Separate Compilation and Linking of CUDA C++ Device Code

This Post was Co-Written by Tony Scudiero and Mike Murphy.

Managing complexity in large programs requires breaking them down into components that are responsible for small, well-defined portions of the overall program. Separate compilation is an integral part of the C and C++ programming languages which allows portions of a program to be compiled into separate objects and then linked together to form an executable or library. Developing large and complex GPU programs is no different, and starting with CUDA 5.0, separate compilation and linking are now important tools in the repertoire of CUDA C/C++ programmers.

In this post, we explore separate compilation and linking of device code and highlight situations where it is helpful. In the process, we’ll walk through a simple example to show how device code linking can let you move existing code to the GPU with minimal changes to your class hierarchy and build infrastructure.

One of the key limitations that device code linking lifts is the need to have all the code for a GPU kernel present when compiling the kernel, including all the device functions that the kernel calls. As C++ programmers, we are used to calling externally defined functions simply by declaring the functions’ prototypes (or including a header that declares them).

Managing Complexity with Separate Compilation

The common approach to organizing and building C++ applications is to define all member functions of a single class in one or more .cpp source files, and compile each .cpp source file into a separate .o object file. Other classes and functions may call these member functions from anywhere in the program by including the class header file; the function implementation is not needed to compile another function that calls it. After compiling all code, the linker connects calls to functions implemented in other files as part of the process of generating the executable.

Let’s imagine a very simple example application which has two classes: a particle class and a three-dimensional vector class, v3, that it uses. Our main task is moving the particle objects along randomized trajectories. Particle filters and Monte Carlo simulations frequently involve operations of this sort. We’ll use a CUDA C++ kernel in which each thread calls particle::advance() on a particle. Continue reading

cuda_pro_tip

CUDA Pro Tip: Understand Fat Binaries and JIT Caching

As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must run on multiple generations of GPUs, the NVIDIA compiler tool chain supports compiling for multiple architectures in the same application executable or library. CUDA also relies on the PTX virtual GPU ISA to provide forward compatibility, so that already deployed applications can run on future GPU architectures. In this post I will give you a basic understanding of CUDA “fat binaries” and compilation for multiple GPU architectures, as well as just-in-time PTX compilation for forward compatibility.

nvcc, the CUDA compiler driver, uses a two-stage compilation model. The first stage compiles source device code to PTX virtual assembly, and the second stage compiles the PTX to binary code for the target architecture. The CUDA driver can execute the second stage compilation at run time, compiling the PTX virtual assembly “Just In Time” to run it. This JIT compilation can cause delay at application start-up time (or more accurately, CUDA context creation time). CUDA uses two approaches to mitigate start-up overhead on JIT compilation: fat binaries and JIT caching.

Fat Binaries

The first approach is to completely avoid the JIT cost by including binary code for one or more architectures in the application binary along with PTX code. The CUDA run time looks for code for the present GPU architecture in the binary, and runs it if found. If binary code is not found but PTX is available, then the driver compiles the PTX code. In this way deployed CUDA applications can support new GPUs when they come out. Continue reading