As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. Because applications must run on multiple generations of GPUs, the NVIDIA compiler tool chain supports compiling for multiple architectures in the same application executable or library. CUDA also relies on the PTX virtual GPU ISA to provide forward compatibility, so that already deployed applications can run on future GPU architectures. In this post I will give you a basic understanding of CUDA “fat binaries” and compilation for multiple GPU architectures, as well as just-in-time PTX compilation for forward compatibility.
nvcc, the CUDA compiler driver, uses a two-stage compilation model. The first stage compiles source device code to PTX virtual assembly, and the second stage compiles the PTX to binary code for the target architecture. The CUDA driver can execute the second stage compilation at run time, compiling the PTX virtual assembly “Just In Time” to run it. This JIT compilation can cause delay at application start-up time (or more accurately, CUDA context creation time). CUDA uses two approaches to mitigate start-up overhead on JIT compilation: fat binaries and JIT caching.
The first approach is to completely avoid the JIT cost by including binary code for one or more architectures in the application binary along with PTX code. The CUDA run time looks for code for the present GPU architecture in the binary, and runs it if found. If binary code is not found but PTX is available, then the driver compiles the PTX code. In this way deployed CUDA applications can support new GPUs when they come out.
nvcc enables compilation for multiple architectures using the
-code command line options. For example, this command generates exact code for two Tesla architecture variants, plus PTX code for use on next-generation GPUs.
nvcc x.cu -arch=compute_10 -code=compute_10,sm_10,sm_13
nvcc organizes device code into “fat binaries”, which are able to hold multiple translations of the same GPU source code. At run time, the CUDA driver selects the most appropriate translation when it launches the device function. For full details of using nvcc to generate code for multiple architectures and PTX versions, see the document “NVIDIA CUDA Compiler Driver NVCC”.
Update (05/08/2014): Starting in CUDA 5.5, we can also JIT link separately compiled code from PTX stored in the fat binary.
The second approach to mitigate JIT overhead is to cache the binaries generated by JIT compilation. When the device driver just-in-time compiles PTX code for an application, it automatically caches a copy of the generated binary code to avoid repeating the compilation in later invocations of the application. The cache—referred to as the compute cache—is automatically invalidated when the device driver is upgraded, so that applications can benefit from improvements in the just-in-time compiler built into the device driver.
Environment variables are available to control just-in-time compilation.
CUDA_CACHE_DISABLEto 1 disables caching (no binary code is added to or retrieved from the cache).
CUDA_CACHE_MAXSIZEspecifies the size of the compute cache in bytes; the default size is 32 MB and the maximum size is 4 GB; binary codes whose size exceeds the cache size are not cached; older binary codes are evicted from the cache to make room for newer binary codes if needed.
CUDA_CACHE_PATHspecifies the directory location of compute cache files; the default values are:
- on Windows,
- on MacOS,
- on Linux,
- on Windows,
CUDA_FORCE_PTX_JITto 1 forces the device driver to ignore any binary code embedded in an application (see Application Compatibility) and to just-in-time compile embedded PTX code instead. If a kernel does not have embedded PTX code, it will fail to load. You can use this environment variable to confirm that an application binary contains PTX code and that just-in-time compilation works as expected to guarantee forward compatibility with future architectures.
It is helpful to know the above options so you can recognize and avoid problems. Let’s look at two example situations: insufficient JIT cache size and cache stored on a slow network share.
Insufficient JIT Cache Size
Recently I was testing an application that uses the CUDA Data Parallel Primitives library (CUDPP), which is a large library with many CUDA kernels. I had compiled CUDPP using the default settings which generated binary code for GPUs with SM versions 1.0, 1.3, and 2.0, as well as PTX. Because I was running on a Tesla K20c with SM version 3.5, all the kernels in the library were JIT compiled, taking about 75 seconds at application start-up. Moreover, the large amount of kernels required well over the default JIT cache size of 32MB, so they were not cached, and the application incurred the full JIT cost at every invocation. Because I had the source to the library, I was able to recompile it with support for
sm_35, but I could also increase the value of CUDA_CACHE_MAXSIZE to make sure the code fit in cache.
Cache stored on a Slow Network Share
On Linux, the default location of the CUDA JIT cache is in your home directory. On clusters, it is not uncommon to mount home directories with relatively poor performance to the compute nodes (by using the Lustre file system for scratch space, but only NFS for the home directory, for example). We have seen cases where this relatively slow connection to the home directory (and thus the JIT cache) resulted in very long application start-up times when the application was not built with code for the right SM version. Even more confusing, start-up time can vary from node to node due to intricacies of the NFS set up.
In this situation, it is best to build the application to avoid JIT entirely, and alternatively, to set CUDA_CACHE_PATH to point to a location on a fast file system.
For more information on the CUDA compilation flow, fat binaries, architecture and PTX versions, and JIT caching, see the CUDA programming guide section on “Compilation with NVCC” and the NVCC documentation.