7 Powerful New Features in OpenACC 2.0

OpenACC is a high-level programming model for accelerators, such as NVIDIA GPUs, that allows programmers to accelerate applications using compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded to an accelerator. Through the use of compiler directives, OpenACC allows programmers to maintain a single source code for the CPU and GPU that is portable across a range of accelerators and operating systems. In the past we featured an introductory series of posts on OpenACC as well as several CUDACasts videos—click here to find them.

OpenACC version 2.0 was ratified last year and is now available in multiple commercial compilers, so now is a good time to discuss the new features of OpenACC 2.0.

Function calls within compute regions

OpenACC 1.0 compilers rely on inlining function and subroutine calls within compute regions. This means that unless the compiler can automatically inline a function call, the programmer must manually inline the function. This limitation proved to be difficult for applications, so OpenACC 2.0 introduces the acc routine directive, which instructs the compiler to build a device version of the function or subroutine so that it may be called from a device region. For readers already familiar with CUDA programming, this functionality is similar to the __device__ function specifier. To guide optimization, you can use clauses to tell the compiler whether the routine should be built for gang, workervector, or seq (sequential) level parallelism. You can specify multiple clauses for routines that may be called at multiple levels of parallelism.

#pragma acc routine vector
void foo(float* v, int i, int n) {
  #pragma acc loop vector
  for ( int j=0; j<n; ++j) {
    v[i*n+j] = 1.0f/(i*j);

#pragma acc parallel loop
for ( int i=0; i<n; ++i) {
  //call on the device

In the above C/C++ example, we have specified that the foo routine may be called from the device and that the loop contained within the function contains vector-level parallelism. Continue reading


CUDACasts Episode 17: Unstructured Data Lifetimes in OpenACC 2.0

The OpenACC 2.0 specification focuses on increasing programmer productivity by addressing limitations of OpenACC 1.0. Previously, programmers were required to use structured code blocks to control when to transfer data to or from the device, which limited the applications that could quickly be accelerated without major code restructuring. It also prevented adding OpenACC directives to handle data movement in the constructors and destructors of C++ classes.

OpenACC 2.0 provides unstructured data lifetime pragmas to make it easier to instruct the compiler to transfer data most efficiently. In today’s CUDACast, I will cover three unstructured data lifetime methods within a single piece of code. Because the example code is fairly long, I’ve uploaded the source to GitHub for you to look at.

Continue reading


CUDACasts Episode #3: Your First OpenACC Program

In the last episode of CUDACasts, we wrote our first accelerated program using CUDA C. In this episode, we will explore an alternate method of accelerating code by using OpenACC directives. These directives give hints to the compiler on how to accelerate sections of code, without having to write CUDA code or change the underlying source.

The algorithm we’ll be accelerating is the Jacobi iteration; you can get a copy of the OpenACC accelerated code from GitHub.

The video presents the typical process for accelerating code with OpenACC. Continue reading

Benchmarking CUDA-Aware MPI

I introduced CUDA-aware MPI in my last post, with an introduction to MPI and a description of the functionality and benefits of CUDA-aware MPI. In this post I will demonstrate the performance of MPI through both synthetic and realistic benchmarks.

Synthetic MPI Benchmark results

Since you now know why CUDA-aware MPI is more efficient from a theoretical perspective, let’s take a look at the results of MPI bandwidth and latency benchmarks. These benchmarks measure the run time for sending messages of increasing size from a buffer associated with MPI rank 0 to a buffer associated with MPI rank 1. Using MVAPICH2-1.9b I have measured the following bandwidths and latencies between two Tesla K20 GPUs installed in two nodes connected with FDR infiniband. I have included host-to-host MPI bandwidth results as a reference. The measured latencies for 1 byte messages are 19 microseconds for regular MPI, 18 microseconds for CUDA-aware MPI with GPUDirect accelerated communication with network and storage devices, and 1 microsecond for host-to-host communication. The peak bandwidths for the 3 cases are 6.19 GB/s for host-to-host transfers, 4.18 GB/s for device-to-device transfers with MVAPICH2-1.9b and GPUDirect, and 1.89 GB/s for device-to-device transfers with staging through host memory.

MPIbandwidth Continue reading

An Introduction to CUDA-Aware MPI

MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it.

I will be presenting a talk on CUDA-Aware MPI at the GPU Technology Conference next Wednesday at 4:00 pm in room 230C, so come check it out!

A Very Brief Introduction to MPI

Before I explain what CUDA-aware MPI is all about, let’s quickly introduce MPI for readers who are not familiar with it. The processes involved in an MPI program have private address spaces, which allows an MPI program to run on a system with a distributed memory space, such as a cluster. The MPI standard defines a message-passing API which covers point-to-point messages as well as collective operations like reductions. The example below shows the source code of a very simple MPI program in C which sends the message “Hello, there” from process 0 to process 1. Note that in MPI a process is usually called a “rank”, as indicated by the call to MPI_Comm_rank() below.

#include <stdio.h>
#include <string.h>
#include <mpi.h>

int main(int argc, char *argv[])
    char message[20];
    int myrank, tag=99;
    MPI_Status status;

    /* Initialize the MPI library */
    MPI_Init(&argc, &argv);
    /* Determine unique id of the calling process of all processes participating
       in this MPI program. This id is usually called MPI rank. */
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    if (myrank == 0) {
        strcpy(message, "Hello, there");
        /* Send the message "Hello, there" from the process with rank 0 to the
           process with rank 1. */
        MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);
    } else {
        /* Receive a message with a maximum length of 20 characters from process
           with rank 0. */
        MPI_Recv(message, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
        printf("received %s\n", message);

    /* Finalize the MPI library to free resources acquired by it. */
    return 0;

This program can be compiled and linked with the compiler wrappers provided by the MPI implementation. Continue reading

In the Trenches at GTC: Programming GPUs with OpenACC

By Adnan Boz (GTC 2012 Guest Blogger)

It’s my first day at the GPU Technology Conference and I’ve already had the opportunity to meet gurus like Mark Harris (Chief Technologist, GPU Computing, NVIDIA, and founder of and learn about the latest advancements in the GPU and HPC arena from people like NVIDIA’s Will Ramey and Duncan Poole.

One of the hot topics so far is OpenACC, an open GPU directives standard that makes GPU programming straightforward and portable across parallel and multi-core processors (see: Continue reading

In the Trenches at GTC: Languages, APIs and Development Tools for GPU Computing

By Michael Wang, The University Of Melbourne, Australia (GTC ’12 Guest Blogger)

It’s 9 am, the first morning session of the pre-conference Tutorial Day. The atmosphere in the room is one of quiet anticipation. NVIDIA’s Will Ramey takes the stage and says: “this is going to be a great week.”

I couldn’t agree more. A quick show of hands reveals that more than 90% of the 200-strong audience had used CUDA in the past week. The prophetic words of Jack Dongarra aptly sum up why we are all here:

GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.

And things couldn’t be easier if you consider the three broad categories of tools available to you today: Continue reading

An OpenACC Example (Part 2)

In my previous post I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I’ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our code. This will provide us with significantly higher speedup.

Example Source Code

You can browse and download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1step 2step 3). The examples use OpenACC 1.0 syntax, so you’ll need a compiler that supports it.  I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.

Let’s pick up where we left off last week. Continue reading

An OpenACC Example (Part 1)

In this post I’ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I’ll add just a few more lines to push that speedup over 4x.

Example Source Code

You can download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1step 2). The examples use OpenACC 1.0 syntax. You’ll need a compiler with support for OpenACC.  I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.

Jacobi Iteration

Let’s look at a somewhat more interesting example. This is still a simple program, but it gives us a bit more room to explore various OpenACC directives and options. Jacobi iteration is a standard iterative method for finding solutions to a system of linear equations.  In this post I’ll take an existing, simple sequential CPU implementation of two-dimensional Jacobi iteration and optimize it using OpenMP and OpenACC. The core computational kernel of this example is shown below. The outer while loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, tol. For benchmarking purposes, we have set the variable tol low enough that the outer loop always runs for iter_max iterations (1000). The first set of inner nested loops on lines 5 through 11 apply a 2D Laplace operator at each element of a 2D grid, and the ones on lines 14 through 18 copy the output back to the input for the next iteration.  For our benchmark, we use a grid of 4096×4096 elements.

while ( error > tol && iter < iter_max ) {
    error = 0.f;

    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 1; j < n-1; j++) {
        for( int i = 1; i < m-1; i++ ) {
            Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                                 + A[j-1][i] + A[j+1][i]);
            error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));

    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 0; j < n-1; j++) {
        for( int i = 0; i < m-1; i++ ) {
            A[j][i] = Anew[j][i];   

    if(iter % 100 == 0) printf("%d, %0.6fn", iter, error);


Continue reading

OpenACC: Directives for GPUs

NVIDIA has made a lot of progress with CUDA over the past five years; we estimate that there are over 150,000 CUDA developers, and important science is being accomplished with the help of CUDA. But we have a long way to go to help everyone benefit from GPU computing. There are many programmers who can’t afford the time to learn and apply a parallel programming language. Others, many of them scientists and engineers working with huge existing code bases, can only make changes to their code that are portable across hardware and OS, and that benefit performance on multiple platforms.

This class of developers needs a higher-level approach to GPU acceleration. They need something that is easy, powerful, portable, and open. This is the motivation behind OpenACC, an open standard defining a collection of compiler directives that specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached parallel accelerator, such as an NVIDIA GPU. Directives are simple (typically a single line) hints provided to the compiler. If you are a C or C++ programmer, you are probably familiar with #pragma directives.


Here is a really simple example using OpenACC. This loop performs a SAXPY operation. SAXPY stands for Single-precision A times X Plus YA is a scalar value and X and Y are vectors, so this is a vector scale and add operation. Here is a simple SAXPY in C with an OpenACC directive to parallelize it.

void saxpy_parallel(int n,
  float a,
  float *x,
  float *restrict y)
#pragma acc kernels
  for (int i = 0; i < n; ++i)
    y[i] = a*x[i] + y[i];

Here’s the equivalent in Fortran. Continue reading