About Jeff Larkin

Jeff is a member of NVIDIA's Developer Technology group where he specializes in performance analysis and optimization of high performance computing applications. Jeff also represents NVIDIA to the OpenACC and OpenMP organizations. Before joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence, located at Oak Ridge National Laboratory.

7 Powerful New Features in OpenACC 2.0

OpenACC is a high-level programming model for accelerators, such as NVIDIA GPUs, that allows programmers to accelerate applications using compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded to an accelerator. Through the use of compiler directives, OpenACC allows programmers to maintain a single source code for the CPU and GPU that is portable across a range of accelerators and operating systems. In the past we featured an introductory series of posts on OpenACC as well as several CUDACasts videos—click here to find them.

OpenACC version 2.0 was ratified last year and is now available in multiple commercial compilers, so now is a good time to discuss the new features of OpenACC 2.0.

Function calls within compute regions

OpenACC 1.0 compilers rely on inlining function and subroutine calls within compute regions. This means that unless the compiler can automatically inline a function call, the programmer must manually inline the function. This limitation proved to be difficult for applications, so OpenACC 2.0 introduces the acc routine directive, which instructs the compiler to build a device version of the function or subroutine so that it may be called from a device region. For readers already familiar with CUDA programming, this functionality is similar to the __device__ function specifier. To guide optimization, you can use clauses to tell the compiler whether the routine should be built for gang, workervector, or seq (sequential) level parallelism. You can specify multiple clauses for routines that may be called at multiple levels of parallelism.

#pragma acc routine vector
void foo(float* v, int i, int n) {
  #pragma acc loop vector
  for ( int j=0; j<n; ++j) {
    v[i*n+j] = 1.0f/(i*j);

#pragma acc parallel loop
for ( int i=0; i<n; ++i) {
  //call on the device

In the above C/C++ example, we have specified that the foo routine may be called from the device and that the loop contained within the function contains vector-level parallelism. Continue reading