Getting Started with OpenACC

This week NVIDIA has released the NVIDIA OpenACC Toolkit, a starting point for anyone interested in using OpenACC. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing without significant programming effort. The toolkit includes the PGI OpenACC Compiler, the NVIDIA Visual Profiler with CPU and GPU profiling, and the new OpenACC Programming and Best Practices Guide. Academics can get a free renewable license to the PGI C,C++ and Fortran compilers with support for OpenACC.

Figure 1: LS-DALTON: Benchmark on Oak Ridge Titan Supercomputer, AMD CPU vs Tesla K20X GPU. Test input: Alanine-3 on CCSD(T) module. Additional information: NICAM COSMO
Figure 1: LS-DALTON: Benchmark on Oak Ridge Titan Supercomputer, AMD CPU vs Tesla K20X GPU. Test input: Alanine-3 on CCSD(T) module. Additional information: NICAM COSMO

OpenACC is an open specification for compiler directives for parallel programming. By using OpenACC, developers can rapidly accelerate existing C, C++, and Fortran applications using high-level directives that help retain application portability across processor architectures. Figure 1 shows some examples of real code speedups with OpenACC. The OpenACC specification is designed and maintained with the cooperation of many industry and academic partners, such as Cray, AMD, PathScale, University of Houston, Oak Ridge National Laboratory and NVIDIA.

When I program with and teach OpenACC I like to use a 4 step cycle to progressively accelerate the code.

  1. Identify Parallelism: Profile the code to understand where the program is spending its time and how much parallelism is available to be accelerated in those important routines. GPUs excel when there’s a significant amount of parallelism to exploit, so look for loops and loop nests with a lot of independent iterations.
  2. Express Parallelism: Placing OpenACC directives on the loops identified in step 1 tells the compiler to parallelize them. OpenACC is all about giving the compiler enough information to effectively accelerate the code, so during this step I add directives to as many loops as I believe I can and move as much of the computation to the GPU as possible.
  3. Express Data Locality: The compiler needs to know not just what code to parallelize, but also which data will be needed on the accelerator by that code. After expressing available parallelism, I often find that the code has slowed down. As you’ll see later in this post, this slowdown comes from the compiler making cautious decisions about when data needs to be moved to the GPU for computation. During this step, I’ll express to the compiler my knowledge of when and how the data is really needed on the GPU.
  4. Optimize – The compiler usually does a very good job accelerating code, but sometimes you can get more performance by giving the compiler a little more information about the loops or by restructuring the code to increase parallelism or improve data access patterns. Most of the time this leads to small improvements, but sometimes gains can be bigger.

Continue reading


Introducing the NVIDIA OpenACC Toolkit

Programmability is crucial to accelerated computing, and NVIDIA’s CUDA Toolkit has been critical to the success of GPU computing. Over 3 million CUDA Toolkits have been downloaded since its first launch. However there are many scientists and researchers yet to benefit from GPU computing. These scientists have limited time to learn and apply a parallel programming language, and they often have huge existing code bases that must remain portable across platforms. Today NVIDIA is introducing the new OpenACC Toolkit to help these researchers and scientists achieve science and engineering goals faster.

Over the last few years OpenACC has established itself as a higher-level approach to GPU acceleration that is simple, powerful, and portable. The membership of the OpenACC organization has grown to include accelerator manufacturers, tools vendors, supercomputing centers and education institutions. The OpenACC 2.0 specification significantly expands the functionality and improves the portability of OpenACC and is now available in many commercial tools.

The NVIDIA OpenACC toolkit provides the tools and documentation that scientists and researchers need to be successful with OpenACC. The toolkit includes a free OpenACC compiler for university developers to remove any barriers for use by academics.

The new OpenACC Toolkit includes the following in a single package. Continue reading


CUDACasts Episode 17: Unstructured Data Lifetimes in OpenACC 2.0

The OpenACC 2.0 specification focuses on increasing programmer productivity by addressing limitations of OpenACC 1.0. Previously, programmers were required to use structured code blocks to control when to transfer data to or from the device, which limited the applications that could quickly be accelerated without major code restructuring. It also prevented adding OpenACC directives to handle data movement in the constructors and destructors of C++ classes.

OpenACC 2.0 provides unstructured data lifetime pragmas to make it easier to instruct the compiler to transfer data most efficiently. In today’s CUDACast, I will cover three unstructured data lifetime methods within a single piece of code. Because the example code is fairly long, I’ve uploaded the source to GitHub for you to look at.

Continue reading