Developing Portable CUDA C/C++ Code with Hemi

hemi-logoSoftware development is as much about writing code fast as it is about writing fast code, and central to rapid development is software reuse and portability. When building heterogeneous applications, developers must be able to share code between projects, platforms, compilers, and target architectures. Ideally, libraries of domain-specific code should be easily retargetable.

In this post I’ll talk about Hemi, a simple open-source C++ header library that simplifies writing portable CUDA C/C++ code. In the screenshot below, both columns show a simple Black-Scholes code written to be compilable with either NVCC or a standard C++ host compiler, and also runnable on either the CPU or a CUDA GPU. The right column is written using Hemi’s macros and smart heterogeneous Array container class, hemi::Array. Using Hemi, the length and complexity of this code is reduced by half.

Portable (between compilers and devices) CUDA code without (left) and with (right) Hemi.

Portable CUDA C++ code without Hemi (left) and with Hemi (right).

CUDA C++ and the NVIDIA NVCC compiler tool chain provide a number of features designed to make it easier to write portable code, including language-level integration of host and device code and data, declaration specifiers (e.g. __host__ and __device__) and preprocessor definitions (e.g. __CUDACC__). Together, these features enable developers to write code that can be compiled and run on either the host, the device, or both. But as the left column above shows, using them directly can result in complicated code. One cause of this is the code duplication that is required to support multiple target platforms, and another cause is the verbose memory management incurred by heterogeneous memory spaces. Hemi aims to tackle both problems.

Hemi is inspired by real-world CUDA software projects like PhysX and OptiX, which use custom libraries of preprocessor macros and container classes that enable the definition of portable application-specific libraries, classes, and kernels. PhysX, for example, has a comprehensive 3D vector math library that is portable across multiple platforms, including CUDA GPUs, Intel and other CPUs, and game consoles. To make CUDA memory management and transfers robust and simple to implement, PhysX uses a smart generic array class that automatically copies data between the device and host only when necessary. The result is much like the right-hand side of the screenshot above, with a minimum of memory management code and no explicit memory copies.

In this post I’ll describe Hemi in depth, but first I want to cover the CUDA C/C++ language and compiler features on which Hemi is built.

CUDA C++ Language Integration and Portability Features

Host / Device Functions

If you are already programming in CUDA C/C++ then you are familiar with __device__, the declaration specifier that indicates a function that is callable from other device functions and kernel (__global__) functions. CUDA also provides the __host__ declaration specifier for host (CPU) functions, which is the default in the absence of a specifier. Often we need to execute exactly the same code on the CPU and GPU, and in those cases we need to write functions that are callable from either the host or the device. In that case, __host__ and __device__ can be combined, as shown in the following inline function that averages two floats.

__host__ __device__ inline float avgf(float x, float y) { return (x+y)/2.0f; }

When NVCC sees this function, it generates two versions of the code, one for the host and one for the device. Any calls to the function from device code will execute the device version, and any calls from host code will execute the host version. This __host__ __device__ combination is very powerful because it enables large utility code bases to be used across heterogeneous applications, minimizing the work required to port applications. However, other compilers (obviously) don’t recognize these declaration specifiers, so to really write portable code, we need to use the C preprocessor.

CUDA Preprocessor Definitions

At compile time NVCC defines several macros that can be used to selectively enable and disable code based on whether it is being compiled by NVCC, whether it is device code or host code, and based on the architecture version (also called compute capability) it is being compiled for.

Can be used in C/C++/CUDA source files to test whether they are currently being compiled by nvcc.
Can be used in source files to test whether they are being treated as CUDA source files by nvcc.
This architecture identification macro is assigned a three-digit value string xy0 (ending in a literal 0) when compiling device code compute_xy. For example, when compiling device code for compute_20 (or sm_20), __CUDA_ARCH__ will be defined by nvcc to the value 200. This macro can be used in the implementation of device and kernel functions to determine the virtual architecture for which it is currently being compiled. Host code must not depend on this macro, but note that it is not defined when host code is being compiled, which means that it can be used to detect compilation of device code.

The following example combines declaration specifiers and preprocessor macros to write a portable routine for counting the number of bits that are set in a 32-bit word.

#ifdef __CUDACC__
__host__ __device__ 
int countLeadingZeros(unsigned int a)
#if defined(__CUDA_ARCH__)
  return __popc(a);
  // Source:
  a = a - ((a >> 1) & 0x55555555);                    
  a = (a & 0x33333333) + ((a >> 2) & 0x33333333);     
  return ((a + (a >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; 

Here I have defined a function countSetBits that is callable from either host or device code and due to the check for __CUDACC__ wrapping “__host__ __device__”, it is compilable using NVCC or other C/C++ compilers. Whether or not it is compiled with NVCC, it uses arithmetic on the CPU to count the 1 bits. On the device, it uses CUDA’s built-in __popc() intrinsic. If you look in CUDA’s device_functions.h header file, you’ll see that the value of __CUDA_ARCH__ is used to further differentiate; on Fermi and later GPUs (sm20, __CUDA_ARCH__ == 200) __popc() generates a single hardware population count instruction, while on earlier architectures it uses code similar to the host code.

Hemi: Easier Portable Code

As you can see, CUDA makes writing portable code feasible and flexible, but doing so is not particularly simple. Hemi, available on Github, provides just two simple header files (and a few examples) that make the task much easier, with much clearer code. The hemi.h header provides simple macros that are useful for reusing code between CUDA C/C++ and C/C++ written for other platforms (e.g. CPUs). The macros are used to decorate function prototypes and variable declarations so that they can be compiled by either NVCC or a host compiler (for example gcc or cl.exe, the MS Visual Studio compiler). The macros can be used within .cu, .cuh, .cpp, .h, and .inl files to define code that can be compiled either for the host or the device.

Before diving into the features of Hemi, let me draw your attention to the Hemi examples.

  • blackscholes: This is a simple example that performs a Black-Scholes options pricing calculation using code that is entirely shared between host code compiled with any C/C++ compiler (including NVCC) and device code that is compiled with NVCC. When compiled with “nvcc -x cu” (to force CUDA compilation on the .cpp file), this runs on the GPU. When compiled with “nvcc” or “g++” it runs on the host.
  • blackscholes_nohemi: Just like the above, except it doesn’t use Hemi. This is just to demonstrate the complexity that Hemi eliminates.
  • blackscholes_hostdevice: This example demonstrates how to write portable code that can be compiled to run the same code on both the host and device, in a single compile & run. This increase in run-time flexibility has a slight complexity cost, but all of the core computational code is reused.
  • blackscholes_hemiarray: This example is the same as the “blackscholes” example, except that it uses hemi::Array to encapsulate CUDA-specific memory management code, and eliminate all explicit host-device memory copy code.
  • nbody_vec4: This example brings all of Hemi’s features together. It implements a simple all-pairs n-body gravitational force calculation using a 4D vector class called Vec4f, which uses Hemi macros to enable all of the code for the class to be shared between host code compiled by the host compiler and device or host code compiled with NVCC. nbody_vec4 also shares most of the all-pairs gravitational force calculation code between device and host, and demonstrates how optimized device implementations (e.g. using shared memory) can be substituted as needed. Finally, this sample also uses hemi::Array to simplify memory management and data transfers.

Hemi Portable Functions

A typical use for host-device code sharing is commonly used utility functions. For example, here is a portable version of our earlier example function that averages two floats.

HEMI_DEV_CALLABLE_INLINE float avgf(float x, float y) { return (x+y)/2.0f; }

This function can be called either from host code or device code, and can be compiled by either the host compiler or NVCC. The macro definition ensures that when compiled by NVCC, both a host and device version of the function are generated, and a normal inline function is generated when compiled by the host compiler. For another example use, see the CND() function defined in the “blackscholes” example included with Hemi, as well as several other functions used in the examples.

Hemi Portable Classes

The HEMI_DEV_CALLABLE_MEMBER and HEMI_DEV_CALLABLE_INLINE_MEMBER macros can be used to create classes that are reusable between host and device code, by decorating any member function prototype that will be used by both device and host code. Here is an example excerpt of a portable class (a 4D vector type used in the “nbody_vec4″ example).

struct HEMI_ALIGN(16) Vec4f
  float x, y, z, w;

  Vec4f() {};

  Vec4f(float xx, float yy, float zz, float ww) : x(xx), y(yy), z(zz), w(ww) {}

  Vec4f(const Vec4f& v) : x(v.x), y(v.y), z(v.z), w(v.w) {}

  Vec4f& operator=(const Vec4f& v) {
    x = v.x; y = v.y; z = v.z; w = v.w;
    return *this;

  Vec4f operator+(const Vec4f& v) const {
    return Vec4f(x+v.x, y+v.y, z+v.z, w+v.w);

The HEMI_ALIGN macro is used on types that will be passed in arrays or pointers as arguments to CUDA device kernel functions, to ensure proper alignment. HEMI_ALIGN generates correct alignment specifiers for host compilers, too. For details on alignment, see the NVIDIA CUDA C Programming Guide (Section 5.3 in v5.0).


Code in functions declared with HEMI_DEV_CALLABLE_* must be portable. In other words it must compile and run correctly for both the host and the device. If it does not, within the function you can use HEMI_DEV_CODE (which reduces to __CUDA_ARCH__) to define separate code for host and device, as in the following example.

float inverseLength(float softening = 0.0f) const {
  return rsqrtf(lengthSqr() + softening); // use fast GPU intrinsic
  return 1.0f / sqrtf(lengthSqr() + softening);

If you need to write a function only for the device, use the CUDA C __device__ specifier directly.

Note: Non-inline functions and methods

Take care when using the non-inline versions of the declaration specifier macros (HEMI_DEV_CALLABLE and HEMI_DEV_CALLABLE_MEMBER) to avoid multiple definition linker errors due to using these in headers that are included into multiple compilation units. The best way to use HEMI_DEV_CALLABLE is to declare functions using this macro in a header, and define their implementation in a .cu file, and compile it with NVCC. This will generate code for both host and device. The host code will be linked into your library or application and callable from other host code compilation units (.c and .cpp files). Likewise, for HEMI_DEV_CALLABLE_MEMBER, put the class and function declaration in a header, and the member function implementations in a .cu file, compiled by NVCC.

Hemi Portable Kernels

Use HEMI_KERNEL to declare functions that are launchable as CUDA kernels when compiled with NVCC, or callable as C/C++ (host) functions when compiled with the host compiler. HEMI_KERNEL_LAUNCH is a convenience macro that launches a kernel function on the device when compiled with NVCC, or calls the host function when compiled with the host compiler. For example, here is an excerpt from the “blackscholes” example, which is a single .cpp file that can be either compiled with NVCC to run on the GPU, or compiled with the host compiler to run on the CPU.

// Black-Scholes formula for both call and put
    (float *callResult, float *putResult, const float *stockPrice,
     const float *optionStrike, const float *optionYears, float Riskfree,
     float Volatility, int optN)

// ... in main() ...
HEMI_KERNEL_LAUNCH(BlackScholes, gridDim, blockDim, 0, 0,
                   d_callResult, d_putResult, d_stockPrice, d_optionStrike, 
                   d_optionYears, RISKFREE, VOLATILITY, OPT_N);

HEMI_KERNEL_LAUNCH requires grid and block dimensions to be passed to it, but these parameters are ignored when compiled for the host. When DEBUG is defined, HEMI_KERNEL_LAUNCH checks for CUDA launch and run-time errors. You can use HEMI_KERNEL_NAME to access the generated name of the kernel function, for example to pass a function pointer to CUDA API functions like cudaFuncGetAttributes().


For kernel functions with simple independent element-wise parallelism, Hemi provides two functions to enable iterating over elements sequentially in host code or in parallel in device code.

  • hemiGetElementOffset() returns the offset of the current thread within the 1D grid, or zero for host code. In device code, it resolves to blockDim.x * blockIdx.x + threadIdx.x.
  • hemiGetElementStride() returns the size of the 1D grid in threads, or one in host code. In device code, it resolves to gridDim.x * blockDim.x.

The “blackscholes” example demonstrates iteration in the following function, which can be compiled and run as a sequential function on the host or as a CUDA kernel on the device.

// Black-Scholes formula for both call and put
    (float *callResult, float *putResult, const float *stockPrice,
     const float *optionStrike, const float *optionYears, float Riskfree,
     float Volatility, int optN)
    int offset = hemiGetElementOffset();
    int stride = hemiGetElementStride();

    for(int opt = offset; opt < optN; opt += stride)
        // ... compute call and put value based on Black-Scholes formula

Note: the hemiGetElement*() functions are specialized to simple (but common) element-wise parallelism. As such, they may not be useful for arbitrary strides, data sharing, or other more complex parallelism arrangements; but they may serve as examples for creating your own.

Hemi Portable Constants

Global constant values can be defined using the HEMI_DEFINE_CONSTANT macro, which takes a name and an initial value. When compiled with NVCC as CUDA code, this declares two versions of the constant, one __constant__ variable for the device, and one normal host variable. When compiled with a host compiler, only the host variable is defined. For static or external linkage, use the HEMI_DEFINE_STATIC_CONSTANT and HEMI_DEFINE_EXTERN_CONSTANT versions of the macro, respectively. To access variables defined using HEMI_DEFINE_*_CONSTANT macros, use the HEMI_CONSTANT macro which automatically resolves to either the device or host constant depending on whether it is called from device or host code. This means that the proper variable will be chosen when the constant is accessed within functions declared with HEMI_DEV_CALLABLE_* and HEMI_KERNEL macros.

To explicitly access the device version of a constant, use HEMI_DEV_CONSTANT. This is useful when the constant is an argument to a CUDA API function such as cudaMemcpyToSymbol, as shown in the following code from the “nbody_vec4″ example.

                   &ss, sizeof(float), 0, cudaMemcpyHostToDevice)

Hemi Portable Data: hemi::Array

One of the biggest challenges in writing portable CUDA code is memory management. Hemi provides the hemi::Array C++ template class (defined in hemi/array.h), a simple data management container which allows arrays of arbitrary type to be created and used with both host and device code. hemi::Array maintains a host and a device pointer for each array. It lazily transfers data between the host and device as needed when the user requests a pointer to the host or device memory. Pointer requests specify read-only, read/write, or write-only options to keep the valid location of data up-to-date and only copy data when the requested pointer is invalid. hemi::Array supports pinned host memory for efficient PCI-express transfers, and handles CUDA error checking internally.

Here is an excerpt from the nbody_vec4 example.

hemi::Array<Vec4f> bodies(N, true);
hemi::Array<Vec4f> forceVectors(N, true);

randomizeBodies(bodies.writeOnlyHostPtr(), N);

// Call host function defined in a .cpp compilation unit
allPairsForcesHost(forceVectors.writeOnlyHostPtr(), bodies.readOnlyHostPtr(), N);

printf("CPU: Force vector 0: (%0.3f, %0.3f, %0.3f)n", 


// Call device function defined in a .cu compilation unit
// that uses host/device shared functions and class member functions
                   bodies.readOnlyDevicePtr(), N, false);

printf("GPU: Force vector 0: (%0.3f, %0.3f, %0.3f)n", 

Typical CUDA code requires explicit duplication of host allocations on the device, and explicit copy calls between them, along with error checking for all allocations and transfers. The “blackscholes_hemiarray” example demonstrates how much hemi::Array simplifies CUDA C code, doing with 136 lines of code what the “blackscholes” example does in 180 lines.

Hemi CUDA Error Checking

hemi.h provides two convenience functions for checking CUDA errors. checkCuda verifies that its single argument has the value cudaSuccess, and otherwise prints an error message and asserts if #DEBUG is defined. This function is typically wrapped around CUDA API calls, as in the following.

checkCuda( cudaMemcpy(d_stockPrice, stockPrice, OPT_SZ, cudaMemcpyHostToDevice) );

checkCudaErrors takes no arguments and checks the current state of the CUDA context for errors. This function synchronizes the CUDA device (cudaDeviceSynchronize()) to ensure asynchronous launch errors are caught. Both checkCuda and checkCudaErrors act as No-ops when DEBUG is not defined (release builds).

Summary: Mix and Match

I designed Hemi to provide a loosely-coupled set of utilities and examples for creating reusable, portable CUDA C/C++ code. Feel free to use the parts that you need and ignore others, or modify and replace portions to suit the needs of your projects. Or just use it as an example and develop your own utilities for writing flexible and portable CUDA code. If you make changes that you feel would be generally useful, please fork the project on github, commit your changes, and submit a pull request! If you would like to give feedback about Hemi, please leave a comment below or file an issue on Github.