Finite Difference Methods in CUDA C/C++, Part 1

In the previous CUDA C/C++ post we investigated how we can use shared memory to optimize a matrix transpose, achieving roughly an order of magnitude improvement in effective bandwidth by using shared memory to coalesce global memory access. The topic of today’s post is to show how to use shared memory to enhance data reuse in a finite difference code. In addition to shared memory, we will also discuss constant memory, which is a cached read-only memory optimized for uniform access across threads in a block (or warp).

Problem Statement: 3D Finite Difference

Our example uses a three-dimensional grid of size 643. For simplicity we assume periodic boundary conditions and only consider first-order derivatives, although extending the code to calculate higher-order derivatives with other types of boundary conditions is straightforward.

The finite difference method essentially uses a weighted summation of function values at neighboring points to approximate the derivative at a particular point. For a (2N+1)-point stencil with uniform spacing ∆x in the direction, the following equation gives a central finite difference scheme for the derivative in x. Equations for the other directions are similar. 

The coefficients Ci are typically generated from Taylor series expansions and can be chosen to obtain a scheme with desired characteristics such as accuracy, and in the context of partial differential equations, dispersion and dissipation. For explicit finite difference schemes such as the type above, larger stencils typically have a higher order of accuracy. For this post we use a nine-point stencil which has eighth-order accuracy. We also choose a symmetric stencil, shown in the following equation.

Here we specify values of the function on the computational grid using the grid indices i, j, k, rather than the physical coordinates x, y, z. Here the coefficients are ax = 4/(5 ∆x) , bx = −1/(5 ∆x) , cx = 4/(105 ∆x), and dx = − 1/(280 ∆x), which is a typical eighth-order scheme. For derivatives in the y and directions, the index offsets in the above equation are simply applied to the j and k indices and the coefficients are the same except that ∆y and ∆z are used in place of ∆x.

Because we calculate an approximation to the derivative at each point on the 643 periodic grid, the value of f at each point is used eight times: once for each right-hand-side term in the above expression. In designing a derivative kernel, our goal is to exploit this data reuse by storing blocks of data in shared memory to ensure we fetch the values of f from global memory as few times as possible.

Data Reuse and Shared Memory

We employ a tiling approach in which each thread block loads a tile of data from the multidimensional grid into shared memory, so that each thread in the block can access all elements of the shared memory tile as needed. How do we choose the best tile shape and size? Some experimentation is required, but characteristics of the finite difference stencil and grid size provide direction.

When choosing a tile shape for stencil calculations, the tiles typically overlap by half of the stencil size, as depicted on the left in the figure below.


Here, in order to calculate the derivative in a 16 × 16 tile (in yellow), the values of f not only from this tile but also from two additional 4 × 16 sections (in orange) must be loaded by each thread block. Overall, the f values in the orange sections get loaded twice, once by the thread block that calculates the derivative at that location, and once by each neighboring thread block. As a result, 8 × 16 values out of 16 × 16, or half of the values, get loaded from global memory twice. In addition, coalescing on devices with a compute capability of 2.0 and higher will be suboptimal with 16 × 16 tiles because perfect coalescing on such devices requires access to data within 32 contiguous elements in global memory per load.

A better choice of tile (and thread block) size for our problem and data size is a 64 × 4 tile, as depicted on the right in the figure above. This tile avoids overlap altogether when calculating the x-derivative for our one-dimensional stencil on a grid of 643 since the tile contains all points in the direction of the derivative. A minimal tile would have just one pencil, or one-dimensional array of all points in a direction. This would correspond to thread blocks of only 64 threads, however, so from an occupancy standpoint it is beneficial to use multiple pencils per tile. In our finite difference code, available for download on Github, we parameterize the number of pencils to allow experimentation. In addition to loading each value of f only once, every warp of threads loads contiguous data from global memory using this tile and therefore achieves perfectly coalesced accesses to global memory.

X-Derivative Implementation

The implementation for the x-derivative kernel follows.

__global__ void derivative_x(float *f, float *df)
  __shared__ float s_f[sPencils][mx+8]; // 4-wide halo

  int i   = threadIdx.x;
  int j   = blockIdx.x*blockDim.y + threadIdx.y;
  int k  = blockIdx.y;
  int si = i + 4;       // local i for shared memory access + halo offset
  int sj = threadIdx.y; // local j for shared memory access

  int globalIdx = k * mx * my + j * mx + i;

  s_f[sj][si] = f[globalIdx];


  // fill in periodic images in shared memory array 
  if (i < 4) {
    s_f[sj][si-4]  = s_f[sj][si+mx-5];
    s_f[sj][si+mx] = s_f[sj][si+1];   


  df[globalIdx] = 
    ( c_ax * ( s_f[sj][si+1] - s_f[sj][si-1] )
    + c_bx * ( s_f[sj][si+2] - s_f[sj][si-2] )
    + c_cx * ( s_f[sj][si+3] - s_f[sj][si-3] )
    + c_dx * ( s_f[sj][si+4] - s_f[sj][si-4] ) );

Here mx, my, and mz are the array dimensions which are set to 64, and sPencils is 4, which is the number of pencils used to make the shared memory tile. The indices i, j, and k correspond to the coordinates in the 643 mesh. The indices si and sj are the local (row, column) coordinates for the shared memory tile. This kernel is launched with a block of 64 × sPencils threads which calculate the derivatives on a x × y tile of 64 × sPencils elements, so each thread calculates the derivative at one point.

The shared memory tile is declared with a padding of 4 elements at each end of the x dimension to accommodate the periodic images needed to calculate the derivative at the endpoints.

__shared__ float s_f[sPencils][mx+8]; // 4-wide halo

We compute an index into the linear input array and then read from global memory into the shared memory tile.

int globalIdx = k * mx * my + j * mx + i;
s_f[sj][si] = f[globalIdx];

These reads from global memory are perfectly coalesced. Data are copied within shared memory to fill out the periodic images in the x-direction by the following code.

// fill in periodic images in shared memory array 
if (i < 4) {
  s_f[sj][si-4]  = s_f[sj][si+mx-5];
  s_f[sj][si+mx] = s_f[sj][si+1];   

Note that in this operation we are reading from shared memory, not from global memory. Each element is read from global memory only once by the previous statement. Since a thread reads data from shared memory that another thread wrote, we need a __syncthreads() call before the periodic images are written. Likewise, we need a __syncthreads() call after the periodic images are written since they are accessed by different threads. After the second __syncthreads() call, our finite difference approximation is calculated using the following code.

df[globalIdx] = 
  ( c_ax * ( s_f[sj][si+1] - s_f[sj][si-1] )
  + c_bx * ( s_f[sj][si+2] - s_f[sj][si-2] )
  + c_cx * ( s_f[sj][si+3] - s_f[sj][si-3] )
  + c_dx * ( s_f[sj][si+4] - s_f[sj][si-4] ) );

The coefficients c_ax, c_bx, c_cx, and c_dx are declared external to this kernel in constant memory.

Constant memory

Constant memory is a special-purpose memory space that is read-only from device code, but can be read and written by the host. Constant memory resides in device DRAM, and is cached on chip. The constant memory cache has only one read port, but can broadcast data from this port across a warp. This means that constant memory access is effective when all threads in a warp read the same address, but when threads in a warp read different addresses the reads are serialized. Constant memory is perfect for coefficients and other data that are used uniformly across threads, as is the case with our coefficients c_ax, c_bx, etc.

In CUDA C/C++, constant data must be declared with global scope, and can be read (only) from device code, and read or written by host code. Constant memory is used in device code the same way any CUDA C variable or array/pointer is used, but it must be initialized from host code using cudaMemcpyToSymbol or one of its variants. In our finite difference code we have the following constant declarations, which use the __constant__ declaration specifier.

// stencil coefficients
__constant__ float c_ax, c_bx, c_cx, c_dx;
__constant__ float c_ay, c_by, c_cy, c_dy;
__constant__ float c_az, c_bz, c_cz, c_dz;

We initialize the x coefficients using the following code, and the y and z coefficients are initialized similarly.

// stencil weights (for unit length problem)
  float dsinv = mx-1.f;

  float ax =  4.f / 5.f   * dsinv;
  float bx = -1.f / 5.f   * dsinv;
  float cx =  4.f / 105.f * dsinv;
  float dx = -1.f / 280.f * dsinv;
  cudaMemcpyToSymbol(c_ax, &ax, sizeof(float), 0, cudaMemcpyHostToDevice);
  cudaMemcpyToSymbol(c_bx, &bx, sizeof(float), 0, cudaMemcpyHostToDevice);
  cudaMemcpyToSymbol(c_cx, &cx, sizeof(float), 0, cudaMemcpyHostToDevice);
  cudaMemcpyToSymbol(c_dx, &dx, sizeof(float), 0, cudaMemcpyHostToDevice);

X Derivative Performance

The derivative_x kernel achieves the following performance on a Tesla K20c GPU.

Using shared memory tile of 64 x 4
   RMS error: 7.277675e-06
   MAX error: 2.861023e-05
   Average time (ms): 0.025912
   Average Bandwidth (GB/s): 80.933625

We will use this performance as a basis for comparison for derivatives in the other directions, which we will cover in the next CUDA C/C++ post.

  • jun peng

    My experiments show that if sPencils=1 the result will be better

    • Mark Ebersole

      Can you provide more details on the hardware you’re using?

      • jun peng

        My program ran on a Quadro K4000 card.

  • Dinesh Kumar

    Would it be more or less efficient, if we calculated all the derivatives in the same kernel with smaller (say 16 x 16) block. How does data reuse compare with coalesced memory access, which to be fair may not be prefect in practical cases?

  • Ernest Yeung

    Arbitrarily (large) sized grids – naively, I changed mx=my=mz for the grid size (originally 64^3) to 92^3 (i.e. mx=my=mz=92) and anything above 92, I obtain a Segmentation fault (core dumped). I was simply curious what was happening; is it a limitation on the GPU hardware? If so, which parameter? I’m on a NVIDIA GeForce GTX 980Ti.

    • The program won’t let you set mx/my/mz to 92 because it is not a whole multiple of 32. It will let you set it to 96. When I do this on a Tesla M40 (same compute capability as your 980Ti) I get no errors. When I try it on my older 750 M (in my laptop), I get a segfault and I’m having trouble finding the root cause. What version of CUDA are you using?

      • I think I figured it out. It looks like it was a stack overflow caused by the arrays f, sol, etc in the function runTest. I didn’t get any error, but when they were too large, it would crash as soon as the function is called. I checked in a fix — just changed to allocate these three arrays on the heap with new instead of making them stack arrays.

        Let me know if this lets you use larger sizes.

        • Ernest Yeung

          @Mark_Harris:disqus Thanks for looking into this.

          With your latest reply, I’m assuming it’s on the github repository? I’ll go there right now.

          Just for completeness, for the second to last reply, once I was obtaining Segmentation faults for multiples of 32, e.g. 128^3, I made a copy of the code and commented out the “long pencils” lpencils and tested out the “small pencils” derivatives and tried to get to where it fails (small pencils, unless I’m wrong, I think only need multiples of 4 for the tile sizes), and “got up to” 92.

          Here’s the device name print out and compute capability; I can print out more info if it may help in the next reply (i.e. cudaGetDeviceProperties)

          Device name: GeForce GTX 980 Ti
          Compute Capability: 5.2

          Segmentation fault (core dumped)

          I’m on CUDA ToolKit 7.5

          P.S. I became very interested in your post because I wanted to implement 3-dim. Navier-Stokes Eq. with finite-difference methods on CUDA C/C++ and obviously derivatives are needed for the gradient and divergence. Has this already been done? (and is it on github?) I Google and github searched as much as I could already.

        • Ernest Yeung

          I made a copy from and I changed the grid size to 352^3 – any choice above 352, e.g. say for mx=384, but my=mz=352, or all 3, mx=my=mz=352, I obtain a compilation error (only doing nvcc in the direction with too large of a grid dimension:

          ptxas error : Entry function ‘_Z21derivative_x_lPencilsPfS_’ uses too much shared data (0xc400 bytes, 0xc000 max)

          Is this because of the inherent limitation of the GTX 980Ti itself? (I can cudaGetDeviceProperties and print out the max. threads per dimension) Or something inherently limited about shared data per block and so this method can’t scale beyond, on the GPU’s shared data per block? For instance, I can do a naive implementation of derivatives on total threads in 1 dim. of 1920 threads (1920x1920x32) on the __global__ memory of the device GPU, 16x16x4 threads per block; any much “bigger” kernel launch I do results in segmentation fault. Using that pencil method, as an improvement of the “naive” shared memory method (i.e. tiling per block, with 2 “dark orange” region blocks, the 4×16 in the very first example in the article above), I believe you’d need a 1920×4 tile, which may not work, if I can only compute the derivative on a 352x352x352 grid? (note 352^3 < 1920*1920*32)

          • You need to modify the code to decouple your total domain size from the tile size. The tile size is limited by available shared memory (see the post on shared memory). But there’s no reason you can’t do domains that are much larger (limited by device memory, or system memory if you use Unified Memory).

          • Ernest Yeung

            I don’t understand what it means to “decouple” the total domain size from the tile size. Let’s take the concrete example of The grid dimensions of the desired grid is (mx,my,mz). The size of a tile, as above in this article, is mx*sPencils (for the case of the x-direction). This seems to be the heart of the “pencils” scheme, choosing the size of the tile in the direction of the derivative to be the entire mx (for the x-direction). This scheme is unlikely to be to scale at all, as look at the grid dimensions and the block dimensions that you’re seeking to launch for each of the kernel function (calls) derivative_x (and derivative_y, derivative_z):

            (my/spencils, mz,1) for the grid dimensions (number of blocks in each dimension) and
            (mx,spencils,1) for the block dimensions (number of threads in each (thread) block)

            Then you’d have mx*spencils*1 threads to launch on each block, which is severely limited by the maximum number of threads per block.

            I do cudaGetDeviceProperties (or I run the code off my github repository and on the GTX 980Ti, I get 1024 Max threads per block.

            However, the number of blocks that can be launched on the grid is much higher (Max grid dimensions: (2147483647, 65535, 65535) ) and so to possibly “save” the pencil method would be to try to launch more blocks.

            Also note that we’re, in this code,, demanding __shared__ float s_f[sPencils][mx+8] an array of floats of size sPencils*(mx+8) and if we want our grid to scale to larger values than 64, say mx = 1920, then we’d have severe demands on the shared memory. I’m not sure if the domain size, mx in this case, can be decoupled from the tile size, which I thinking is, correct me if I’m wrong, this __shared__ 2-dim. array in this case, which depends directly on mx, without throwing out this idea of pencils entirely.

            I don’t know how to specifically modify the example here ( to scale beyond a grid of 354^3: I’m suspecting this is the maximum because of the Max threads per block (1024) for the GTX 980 Ti. I’ve tried changing the size of I also don’t understand how to decouple this domain size from the tile size, without throwing out the entire idea of pencils.

            I put a write up of this post, because, at least to me, writing in LaTeX is much clearer, here:
            pp. 30 or the section “Note on finite-difference methods on the shared memory of the device GPU, in particular, the pencil Then the total number of threads launched in each direction, x and y, is method, that attempts to improve upon the double loading of boundary “halo” cells (of the grid).”

          • Threads can execute loops, so you can loop and process multiple tiles of the entire domain using a single thread block. don’t try to fit the entire domain in shared memory at once. That’s what I mean by decouple.

            I’m sorry Ernest, but I have to leave this as an exercise for you, I don’t have time to solve it for you. I didn’t develop this approach, I only translated the Fortran posts that Greg Ruetsch wrote to C++ (

  • Tom’s

    Thanks a lot for these clear tutorials!

    When I run the FDM code on my GTX970 with CUDA 8, there is almost no difference between 4×64 and 32×64 for Y and Z derivatives. Does it mean that coalescence is not so important any more ? :)

  • Chih-Che Chueh

    Hi Mark,

    Thanks for this wonderful CUDA C problem from a series of great tutorials that you did make. I have learned a lot from these posts from you in the past few months.

    After I go through this post carefully, I would like to bring in a question: I find that there might have a typo in the kernel “derivative_x”, where the “j” index representing the index in y direction is equal to blockIdx.x*blockDim.y + threadIdx.y. But I think that in terms of grid[0][0] and block[0][0] case, “blockDim.y” should be replaced with “sPencils” (where sPencils = 4) for a correction so that j=blockIdx.x*sPencils + threadIdx.y in order to avoid a mismatch. Am I right?

    I look forward to hearing from you. Thanks!