Accelerating a C++ CFD code with OpenACC

Computational Fluid Dynamics (CFD) is a valuable tool to study the behavior of fluids. Today, many areas of engineering use CFD. For example, the automotive industry uses CFD to study airflow around cars, and to optimize the car body shapes to reduce drag and improve fuel efficiency. To get accurate results in fluid simulation it is necessary to capture complex phenomena such as turbulence, which requires very accurate models. These complex models result in very long computing times. In this post I describe how I used OpenACC to accelerate the ZFS C++ CFD solver with NVIDIA Tesla GPUs.

The ZFS flow solver

Figure 1: Using ZFS to study fluid flow within an internal combustion engine with moving pistons and valves.

The C++ flow solver ZFS (Zonal Flow Solver) is developed at the Institute of Aerodynamics at RWTH Aachen, Germany. ZFS solves the unsteady Navier-Stokes equations for compressible flows on automatically generated hierarchical Cartesian grids with a fully-conservative second-order-accurate finite-volume method [1, 2, 3]. To integrate the flow equations in time ZFS uses a 5-step Runge-Kutta method with dual time stepping [2]. It imposes boundary conditions using a ghost-cell method [4] that can handle multiple ghost cells [5, 6]. ZFS supports complex moving boundaries which are sharply discretized using a cut-cell type immersed-boundary method [1, 2, 7].

Among other topics, scientists have used ZFS to study the flow within an internal combustion engine with moving pistons and valves, as Figure 1 shows. Figure 2 shows how the Lattice-Boltzmann solver in ZFS was used to better understand airflow within the human nasal cavity.
Continue reading

CUDA for ARM Platforms is Now Available

SECO mITX GPU DEVKIT_340In 2012 alone, over 8.7 billion ARM-based chips were shipped worldwide. Many developers of GPU-accelerated applications are planning to port their applications to ARM platforms, and some have already started. I recently chatted about this with John Stone, the lead developer of VMD, a high performance (and CUDA-accelerated) molecular visualization tool used by researchers all over the world. But first … some exciting news.

To help developers working with ARM-based computing platforms, we are excited to announce the public availability of the CUDA Toolkit version 5.5 Release Candidate (RC) with support for the ARM CPU architecture. This latest release of the CUDA Toolkit includes support for the following features and functionality on ARM-based platforms.

  • The CUDA C/C++ compiler (nvcc), debugging tools (cuda-gdb and cuda-memcheck), and the command-line profiler (nvprof). (Support for the NVIDIA Visual Profiler and NSight Eclipse Edition to come; for now, I recommend capturing profiling data with nvprof and viewing it in the Visual Profiler.)
  • Native compilation on ARM CPUs, for fast and easy application porting.
  • Fast cross-compilation on x86 CPUs, which reduces development time for large applications by enabling developers to compile ARM code on faster x86 processors, and then deploy the compiled application on the target computer.
  • GPU-accelerated libraries including CUFFT (FFT), CUBLAS (linear algebra), CURAND (random number generation), CUSPARSE (sparse linear algebra), and NPP (NVIDIA Performance primitives for signal and image processing).
  • Complete documentation, code samples, and more to help developers quickly learn how to take advantage of GPU-accelerated parallel computing on ARM-based systems.

Continue reading

Assess, Parallelize, Optimize, Deploy

When developing an application from scratch it is feasible to design the code, data structures, and data movement to support accelerators. However when facing an existing application it is often hard to know where to start, what to expect, and how best to make use of an accelerator like a GPU. Based on our experience working with various application developers to help them accelerate applications using NVIDIA GPUs, we have documented a process that allows one to incrementally add improvements to the code. It’s not complex, and to some people it may be obvious, but even for experts writing it down helps to structure the effort and leads to faster results.

The process consists of four stages: AssessParallelizeOptimizeDeploy, or APOD, executed in a cycle. Having identified where to start developing, our goal is to realize and deploy a benefit before returning to the first stage, Assess, and adding further improvements.

Continue reading