Accelerating a C++ CFD code with OpenACC

Computational Fluid Dynamics (CFD) is a valuable tool to study the behavior of fluids. Today, many areas of engineering use CFD. For example, the automotive industry uses CFD to study airflow around cars, and to optimize the car body shapes to reduce drag and improve fuel efficiency. To get accurate results in fluid simulation it is necessary to capture complex phenomena such as turbulence, which requires very accurate models. These complex models result in very long computing times. In this post I describe how I used OpenACC to accelerate the ZFS C++ CFD solver with NVIDIA Tesla GPUs.

The ZFS flow solver

The C++ flow solver ZFS (Zonal Flow Solver) is developed at the Institute of Aerodynamics at RWTH Aachen, Germany. ZFS solves the unsteady Navier-Stokes equations for compressible flows on automatically generated hierarchical Cartesian grids with a fully-conservative second-order-accurate finite-volume method [1, 2, 3]. To integrate the flow equations in time ZFS uses a 5-step Runge-Kutta method with dual time stepping [2]. It imposes boundary conditions using a ghost-cell method [4] that can handle multiple ghost cells [5, 6]. ZFS supports complex moving boundaries which are sharply discretized using a cut-cell type immersed-boundary method [1, 2, 7].

Among other topics, scientists have used ZFS to study the flow within an internal combustion engine with moving pistons and valves, as Figure 1 shows. Figure 2 shows how the Lattice-Boltzmann solver in ZFS was used to better understand airflow within the human nasal cavity.

NVIDIA Nsight Eclipse Edition for Jetson TK1

NVIDIA® Nsight™ Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA® applications for either your local (x86) system or a remote (x86 or ARM) target. In this post, I will walk you through the process of remote-developing CUDA applications for the NVIDIA Jetson TK1, an ARM-based development kit.

Nsight supports two remote development modes: cross-compilation and “synchronize projects” mode. Cross-compiling for ARM on your x86 host system requires that all of the ARM libraries with which you will link your application be present on your host system. In synchronize-projects mode, on the other hand, your source code is synchronized between host and target systems and compiled and linked directly on the remote target, which has the advantage that all your libraries get resolved on the target system and need not be present on the host. Neither of these remote development modes requires an NVIDIA GPU to be present in your host system.

Note: CUDA cross-compilation tools for ARM are available only in the Ubuntu 12.04 DEB package of the CUDA 6 Toolkit.  If your host system is running a Linux distribution other than Ubuntu 12.04, I recommend the synchronize-projects remote development mode, which I will cover in detail in a later blog post.

CUDA toolkit setup

The first step involved in cross-compilation is installing the CUDA 6 Toolkit on your host system. To get started, let’s download the required Ubuntu 12.04 DEB package from the CUDA download page. Installation instructions can be found in the Getting Started Guide for Linux, but I will summarize them below for CUDA 6.

CUDA Spotlight: GPU-Accelerated Deep Learning

Our Spotlight is on Dr. Ren Wu, a distinguished scientist at Baidu’s Institute of Deep Learning (IDL).

He is known for his pioneering research in using GPUs to accelerate big data analytics and his contribution to large-scale clustering algorithms via the GPU. Ren was a speaker at GTC14 and was originally featured as a CUDA Spotlight in 2011 when he worked at HP Labs.

[Editor's note: On May 16, Baidu announced the hiring of Dr. Andrew Ng to lead Baidu's Silicon Valley Research Lab.]

The following is an excerpt from our interview (read the complete Spotlight here).
______________________________________________________________________________

NVIDIA: Ren, why is GPU computing important to your work?
Ren: A key factor in the progress we are making with deep learning is that we now have much greater computing resources in our hands.

Today one or two workstations with a few GPUs has the same computing power as the fastest supercomputer in the world 15 years ago, thanks to GPU computing and NVIDIA’s vision.

CUDA Dynamic Parallelism API and Principles

This post is the second in a series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. This post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. My next post will finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment (Facility for Antiproton and Ion Research in Europe (FAIR)).

Grid Nesting and Synchronization

In the CUDA programming model, a group of blocks of threads that are running a kernel is called a grid. In CUDA Dynamic Parallelism, a parent grid launches kernels called child grids. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size. Note that every thread that encounters a kernel launch executes it. Therefore, if the parent grid has 128 blocks with 64 threads each, and there is no control flow around a child kernel launch, then the grid will perform a total of 8192 kernel launches. If you want a kernel to only launch one child grid per thread block, you should launch the kernel from a single thread of each block as in the following code.

if(threadIdx.x == 0) {
child_k <<< (n + bs - 1) / bs, bs >>> ();
}

AmgX V1.0: Enabling Reservoir Simulation with Classical AMG

Back in January I wrote a post about the public beta availability of AmgX, a linear solver library for large-scale industrial applications.  Since then, AmgX has grown up!  Now we can solve problems that were impossible for us before, due to the addition of “classical” Algebraic Multi-Grid (often called Ruge-Stueben AMG).  V1.0 comes complete with classical AMG multi-GPU support, greatly improved scalability, and we have some nice performance numbers to back it up.

Models of Flow

One specific class of problem has eluded us, until now.  In the oil and gas industry, reservoir simulation is used to predict the behavior of wells producing from large hydrocarbon deposits, and more recently from shale gas or shale oil fields.  These problems are models of flow through porous media, coupled with flow through networks of fractures, piping and processing equipment, but it is the media that makes all the difference.  Oil and gas deposits aren’t like big caves with lakes of oil, they are more like complex, many-layered sponges, each with different pore sizes, stiffness and hydrocarbon content.

Some applications launch many tiny kernels, making them prone to very large (100s of megabytes or larger) nvprof timeline dumps, even for application runs of only a handful of seconds.

Such nvprof files may fail to even load when you try to import them into the NVIDIA Visual Profiler (NVVP). One symptom of this problem is that when you click “Finish” on the import screen, NVVP “thinks” for a minute or so, but then just goes right back to the import screen asking you to click Finish again. In other cases, attempting to load a large file can result in NVVP “thinking” about it for many hours.

It turns out that this problem is because of the Java max heap size setting specified in the libnvvp/nvvp.ini file of the CUDA Toolkit installation: the profiler configures the Java VM to cap the heap size at 1GB in order to work even on systems with minimal physical memory.  While this 1GB value is already an improvement over the 512MB setting used in earlier CUDA versions, it is still not enough for some applications, considering that the memory footprint of the profiler can be at least four to five times larger than the input file size.

Adaptive Parallel Computation with CUDA Dynamic Parallelism

Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU. For applications consisting of “parallel for” loops the bulk parallel model is not too limiting, but some parallel patterns—such as nested parallelism—cannot be expressed so easily. Nested parallelism arises naturally in many applications, such as those using adaptive grids, which are often used in real-world applications to reduce computational complexity while capturing the relevant level of detail. Flat, bulk parallel applications have to use either a fine grid, and do unwanted computations, or use a coarse grid and lose finer details.

CUDA 5.0 introduced Dynamic Parallelism, which makes it possible to launch kernels from threads running on the device; threads can launch more threads. An application can launch a coarse-grained kernel which in turn launches finer-grained kernels to do work where needed. This avoids unwanted computations while capturing all interesting details, as Figure 1 shows.

Dynamic parallelism is generally useful for problems where nested parallelism cannot be avoided. This includes, but is not limited to, the following classes of algorithms:

• algorithms using hierarchical data structures, such as adaptive grids;
• algorithms using recursion, where each level of recursion has parallelism, such as quicksort;
• algorithms where work is naturally split into independent batches, where each batch involves complex parallel processing but cannot fully use a single GPU.

Dynamic parallelism is available in CUDA 5.0 and later on devices of Compute Capability 3.5 or higher (sm_35). (See NVIDIA GPU Compute Capabilities.)

This post introduces Dynamic Parallelism by example using a fast hierarchical algorithm for computing images of the Mandelbrot set.  This is the first of a three part series on CUDA Dynamic Parallelism:

CUDACasts Episode 19: CUDA 6 Guided Performance Analysis with the Visual Profiler

One of the main reasons for accelerating code on an NVIDIA GPU is for an increase in application performance. This is why it’s important to use the best tools available to help you get the performance you’re looking for. CUDA 6 includes great improvements to the guided analysis tool in the NVIDIA Visual Profiler. Watch today’s CUDACast to see how to use guided analysis to locate potential optimizations for your GPU code.

You can find the code used in this video in the CUDACasts GitHub repository.

CUDA Spotlight: GPU-Accelerated Deep Neural Networks

This week’s Spotlight is on Dr. Dan Ciresan, a senior researcher at IDSIA in Switzerland and a pioneer in using CUDA for Deep Neural Networks (DNNs).

His methods have won international competitions on topics such as classifying traffic signs and recognizing handwritten Chinese characters. Dan presented a session on Deep Neural Networks for Visual Pattern Recognition at GTC in March 2014.

The following is an excerpt from our interview (read the complete Spotlight here).

$G(i, j, \theta) = \begin{bmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & c & \cdots & -s & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & s & \cdots & c & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{bmatrix}$,
where the intersections of the $i$th and $j$th columns contain the values $c=cos \theta$ and $s=sin \theta$. Multiplying a vector $x$ by a Givens rotation matrix $G(i, j, \theta)$ represents a rotation of the vector $x$ in the $(i, j)$ plane by $\theta$ radians.