cuda_cpp_simple

An Easy Introduction to CUDA C and C++

This post is the first in a series on CUDA C and C++, which is the C/C++ interface to the CUDA parallel computing platform. This series of posts assumes familiarity with programming in C. We will be running a parallel series of posts about CUDA Fortran targeted at Fortran programmers . These two series will cover the basic concepts of parallel computing on the CUDA platform. From here on unless I state otherwise, I will use the term “CUDA C” as shorthand for “CUDA C and C++”. CUDA C is essentially C/C++ with a few extensions that allow one to execute functions on the GPU using many threads in parallel.

CUDA Programming Model Basics

Before we jump into CUDA C code, those new to CUDA will benefit from a basic description of the CUDA programming model and some of the terminology used.

The CUDA programming model is a heterogeneous model in which both the CPU and GPU are used. In CUDA, the host refers to the CPU and its memory, while the device refers to the GPU and its memory. Code run on the host can manage memory on both the host and device, and also launches kernels which are functions executed on the device. These kernels are executed by many GPU threads in parallel.

Given the heterogeneous nature of the CUDA programming model, a typical sequence of operations for a CUDA C program is:

  1. Declare and allocate host and device memory.
  2. Initialize host data.
  3. Transfer data from the host to the device.
  4. Execute one or more kernels.
  5. Transfer results from the device to the host.

Keeping this sequence of operations in mind, let’s look at a CUDA C example.

A First CUDA C Program

In a recent post, I illustrated Six Ways to SAXPY, which includes a CUDA C version. SAXPY stands for “Single-precision A*X Plus Y”, and is a good “hello world” example for parallel computation. In this post I will dissect a more complete version of the CUDA C SAXPY, explaining in detail what is done and why. The complete SAXPY code is:

#include 

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(&d_x, N*sizeof(float)); 
  cudaMalloc(&d_y, N*sizeof(float));

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements
  saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);

  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = max(maxError, abs(y[i]-4.0f));
  printf("Max error: %fn", maxError);
}

The function saxpy is the kernel that runs in parallel on the GPU, and the main function is the host code. Let’s begin our discussion of this program with the host code.

Host Code

The main function declares two pairs of arrays.

  float *x, *y, *d_x, *d_y;
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));

  cudaMalloc(d_x, N*sizeof(float)); 
  cudaMalloc(d_y, N*sizeof(float));

The pointers x and y point to the host arrays, allocated with malloc in the typical fashion, and the d_x and d_y arrays point to device arrays allocated with the cudaMalloc function from the CUDA runtime API. The host and device in CUDA have separate memory spaces, both of which can be managed from host code (CUDA C kernels can also allocate device memory on devices that support it).

The host code then initializes the host arrays.  Here we set x to an array of ones, and y to an array of twos.

  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

To initialize the device arrays, we simply copy the data from x and y to the corresponding device arrays d_x and d_y using cudaMemcpy, which works just like the standard C memcpy function, except that it takes a fourth argument which specifies the direction of the copy. In this case we use cudaMemcpyHostToDevice to specify that the first (destination) argument is a device pointer and the second (source) argument is a host pointer.

  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

After running the kernel, to get the results back to the host, we copy from the device array pointed to by d_y to the host array pointed to by y by using cudaMemcpy with cudaMemcpyDeviceToHost.

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

Launching a Kernel

The saxpy kernel is launched by the statement:

saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);

The information between the triple chevrons is the execution configuration, which dictates how many device threads execute the kernel in parallel. In CUDA there is a hierarchy of threads in software which mimics how thread processors are grouped on the GPU. In the CUDA programming model we speak of launching a kernel with a grid of thread blocks. The first argument in the execution configuration specifies the number of thread blocks in the grid, and the second specifies the number of threads in a thread block.

Thread blocks and grids can be made one-, two- or three-dimensional by passing dim3 (a simple struct defined by CUDA with x, y, and z members) values for these arguments, but for this simple example we only need one dimension so we pass integers instead. In this case we launch the kernel with thread blocks containing 256 threads, and use integer arithmetic to determine the number of thread blocks required to process all N elements of the arrays ((N+255)/256).

For cases where the number of elements in the arrays is not evenly divisible by the thread block size, the kernel code must check for out-of-bounds memory accesses.

Device Code

We now move on to the kernel code.

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

In CUDA, we define kernels such as saxpy using the __global__ declaration specifier. Variables defined within device code do not need to be specified as device variables because they are assumed to reside on the device. In this case the n, a and i variables will be stored by each thread in a register, and the pointers x and y must be pointers to the device memory address space. This is indeed true because we passed d_x and d_y to the kernel when we launched it from the host code. The first two arguments, <coden and a, however, were not explicitly transferred to the device in host code. Because function arguments are passed by value by default in C/C++, the CUDA runtime can automatically handle the transfer of these values to the device. This feature of the CUDA Runtime API makes launching kernels on the GPU very natural and easy—it is almost the same as calling a C function.

There are only two lines in our saxpy kernel. As mentioned earlier, the kernel is executed by multiple threads in parallel. If we want each thread to process an element of the resultant array, then we need a means of distinguishing and identifying each thread. CUDA defines the variables blockDimblockIdx, and threadIdx. These predefined variables are of type dim3, analogous to the execution configuration parameters in host code. The predefined variable blockDim contains the dimensions of each thread block as specified in the second execution configuration parameter for the kernel launch. The predefined variables threadIdx and blockIdx contain the index of the thread within its thread block and the thread block within the grid, respectively. The expression:

    int i = blockDim.x * blockIdx.x + threadIdx.x

generates a global index that is used to access elements of the arrays. We didn’t use it in this example, but there is also gridDim which contains the dimensions of the grid as specified in the first execution configuration parameter to the launch.

Before this index is used to access array elements, its value is checked against the number of elements, n, to ensure there are no out-of-bounds memory accesses. This check is required for cases where the number of elements in an array is not evenly divisible by the thread block size, and as a result the number of threads launched by the kernel is larger than the array size. The second line of the kernel performs the element-wise work of the SAXPY, and other than the bounds check, it is identical to the inner loop of a host implementation of SAXPY.

if (i < n) y[i] = a*x[i] + y[i];

Compiling and Running the Code

The CUDA C compiler, nvcc, is part of the NVIDIA CUDA Toolkit.  To compile our SAXPY example, we save the code in a file with a .cu extension, say saxpy.cu. We can then compile it with nvcc.

nvcc -o saxpy saxpy.cu

We can then run the code:

% ./saxpy
Max error: 0.000000

Summary and Conclusions

With this walkthrough of a simple CUDA C implementation of SAXPY, you now know the basics of programming CUDA C. There are only a few extensions to C required to “port” a C code to CUDA C: the __global__ declaration specifier for device kernel functions; the execution configuration used when launching a kernel; and the built-in device variables blockDim, blockIdx, and threadIdx used to identify and differentiate GPU threads that execute the kernel in parallel.

One advantage of the heterogeneous CUDA programming model is that porting an existing code from C to CUDA C can be done incrementally, one kernel at a time.

In the next post of this series, we will look at some performance measurements and metrics.

Note: this post is based on the post “An Easy Introduction to CUDA Fortran” by Gregory Reutsch.

∥∀

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
  • Nisarg Shah

    I think there’s an error in the line “In this case we use cudaMemcpyHostToDevice to specify that the first argument is a host pointer and the second argument is a device pointer.”. Shouldn’t it be “In this case we use cudaMemcpyHostToDevice to specify that the first argument is a device pointer and the second argument is a host pointer.”? I think you exchanged host pointer with device pointer.

    • http://www.markmark.net/ Mark Harris

      Thanks Nisarg, good catch. I’ve fixed this error.

  • Charles Turner

    Hi, thanks for this tutorial!

    I have a GeForce 210 card, and when I run this program, I get

    Max error: 2.000000

    Whereas you see a max error of zero. It seems like the y[i] array is not getting operated on with my computer setup, but I get no compiler errors with nvcc. When I print out the result of max(maxError, abs(y[i]-4.0f)), I see the value 2.000000 every time, indicating that nothing happened to the y array on the device.

    Do you have any advice on what might be going wrong here?

    Thanks,
    Charles.

    • http://www.markmark.net/ Mark Harris

      I suspect a CUDA error. Unfortunately the code example doesn’t check for errors, because that is the lesson of the follow up post. Can you add error checking as shown in the post http://devblogs.nvidia.com/parallelforall/how-query-device-properties-and-handle-errors-cuda-cc/ and see what errors you see?

      • Charles Turner

        Ah, I did have some errors. I was using Ubuntu 13.10 and tried some “workarounds” involving third-party PPAs since this isn’t an officially supported platform. I couldn’t get the workarounds to work it seems, but instead of tracking down the problem, I installed 12.04 LTS and followed the official instructions, now everything seems to be working and I get the “Max error: 0.00000″ output now. Yay!

        Thank you for taking the time to help me, Mark.

        • http://www.markmark.net/ Mark Harris

          My pleasure, glad you got it working.

  • dylanholmes206

    There are typos in the first paragraphs of both the Device Code and Host Code sections, x_d and y_d should be d_x and d_y.

    • http://www.markmark.net/ Mark Harris

      I’ve fixed these. Thank you!

  • rogers

    I got the error: #include expects filename, what is the filename?

    • Mark Harris

      Sorry about that, the code got mangled. It’s fixed now (the filename is ).

    • http://www.markmark.net/ Mark Harris

      Sorry about that, the code got mangled by the site. I’ve fixed it. (The filename is ).