About Jeff Larkin

Jeff is a member of NVIDIA's Developer Technology group where he specializes in performance analysis and optimization of high performance computing applications. Jeff also represents NVIDIA to the OpenACC and OpenMP organizations. Before joining NVIDIA, Jeff worked in the Cray Supercomputing Center of Excellence, located at Oak Ridge National Laboratory.
openacc-logo-thumb

3 Versatile OpenACC Interoperability Techniques

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. An often-overlooked feature of OpenACC is its ability to interoperate with the broader parallel programming ecosystem. In this post I’ll teach you 3 powerful interoperability techniques for combining OpenACC and CUDA: the host_data construct, the deviceptr clause, and the acc_map_data() API function.

OpenACC InteropI’ll demonstrate these techniques with several examples of mixing OpenACC with CUDA C++, CUDA Fortran, Thrust, and GPU-accelerated libraries. If you’d like to follow along at home, grab the source code for the examples from Github and try them out with your OpenACC compiler and the CUDA Toolkit. Don’t have an OpenACC compiler? You can download a free 30-day trial of the PGI accelerator compiler.

You may already be thinking to yourself, “If OpenACC is so great, why would I need to use it with CUDA?” OpenACC interoperability features open the door to the GPU-computing ecosystem, allowing you to leverage more than 10 years of code development. Need to multiply two matrices together? Don’t write your own function, just call the cuBLAS library, which has been heavily optimized for GPUs. Does your colleague already have a CUDA routine that you could use in your code? Use it! Interoperability means that you can always use the best tool for the job in any situation. Accelerate your application using OpenACC, but call an optimized library. Expand an existing CUDA application by adding OpenACC to unaccelerated routines. Your choice isn’t OpenACC or CUDA, it’s OpenACC and CUDA. Continue reading

openacc-logo-thumb

7 Powerful New Features in OpenACC 2.0

OpenACC is a high-level programming model for accelerators, such as NVIDIA GPUs, that allows programmers to accelerate applications using compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded to an accelerator. Through the use of compiler directives, OpenACC allows programmers to maintain a single source code for the CPU and GPU that is portable across a range of accelerators and operating systems. In the past we featured an introductory series of posts on OpenACC as well as several CUDACasts videos—click here to find them.

OpenACC version 2.0 was ratified last year and is now available in multiple commercial compilers, so now is a good time to discuss the new features of OpenACC 2.0.

Function calls within compute regions

OpenACC 1.0 compilers rely on inlining function and subroutine calls within compute regions. This means that unless the compiler can automatically inline a function call, the programmer must manually inline the function. This limitation proved to be difficult for applications, so OpenACC 2.0 introduces the acc routine directive, which instructs the compiler to build a device version of the function or subroutine so that it may be called from a device region. For readers already familiar with CUDA programming, this functionality is similar to the __device__ function specifier. To guide optimization, you can use clauses to tell the compiler whether the routine should be built for gang, workervector, or seq (sequential) level parallelism. You can specify multiple clauses for routines that may be called at multiple levels of parallelism.

#pragma acc routine vector
void foo(float* v, int i, int n) {
  #pragma acc loop vector
  for ( int j=0; j<n; ++j) {
    v[i*n+j] = 1.0f/(i*j);
  }
}

#pragma acc parallel loop
for ( int i=0; i<n; ++i) {
  foo(v,i);
  //call on the device
}

In the above C/C++ example, we have specified that the foo routine may be called from the device and that the loop contained within the function contains vector-level parallelism. Continue reading