This Post was Co-Written by Tony Scudiero and Mike Murphy.
Managing complexity in large programs requires breaking them down into components that are responsible for small, well-defined portions of the overall program. Separate compilation is an integral part of the C and C++ programming languages which allows portions of a program to be compiled into separate objects and then linked together to form an executable or library. Developing large and complex GPU programs is no different, and starting with CUDA 5.0, separate compilation and linking are now important tools in the repertoire of CUDA C/C++ programmers.
In this post, we explore separate compilation and linking of device code and highlight situations where it is helpful. In the process, we’ll walk through a simple example to show how device code linking can let you move existing code to the GPU with minimal changes to your class hierarchy and build infrastructure.
One of the key limitations that device code linking lifts is the need to have all the code for a GPU kernel present when compiling the kernel, including all the device functions that the kernel calls. As C++ programmers, we are used to calling externally defined functions simply by declaring the functions’ prototypes (or including a header that declares them).
Managing Complexity with Separate Compilation
The common approach to organizing and building C++ applications is to define all member functions of a single class in one or more
.cpp source files, and compile each
.cpp source file into a separate
.o object file. Other classes and functions may call these member functions from anywhere in the program by including the class header file; the function implementation is not needed to compile another function that calls it. After compiling all code, the linker connects calls to functions implemented in other files as part of the process of generating the executable.
Let’s imagine a very simple example application which has two classes: a
particle class and a three-dimensional vector class,
v3, that it uses. Our main task is moving the particle objects along randomized trajectories. Particle filters and Monte Carlo simulations frequently involve operations of this sort. We’ll use a CUDA C++ kernel in which each thread calls
particle::advance() on a particle. Continue reading