In the last CUDA C++ post we dove in to 3D finite difference computations in CUDA C/C++, demonstrating how to implement the *x *derivative part of the computation. In this post, let’s continue by exploring how we can write efficient kernels for the *y* and *z *derivatives. As with the previous post, code for the examples in this post is available for download on Github.

# Y and Z Derivatives

We can easily modify the *x *derivative code to operate in the other directions. In the *x *derivative each thread block calculates the derivatives in an *x*, *y* tile of 64 × `sPencils` elements. For the *y *derivative we can have a thread block calculate the derivative on a tile of `sPencils × `64 elements in *x*, *y*, as depicted on the left in the figure below.

Likewise, for the *z *derivative a thread block can calculate the derivative in a *x*, *z* tile of `sPencils × `64 elements. The kernel below shows the *y *derivative kernel using this approach. Continue reading