Introducing the NVIDIA OpenACC Toolkit

Programmability is crucial to accelerated computing, and NVIDIA’s CUDA Toolkit has been critical to the success of GPU computing. Over 3 million CUDA Toolkits have been downloaded since its first launch. However there are many scientists and researchers yet to benefit from GPU computing. These scientists have limited time to learn and apply a parallel programming language, and they often have huge existing code bases that must remain portable across platforms. Today NVIDIA is introducing the new OpenACC Toolkit to help these researchers and scientists achieve science and engineering goals faster.

Over the last few years OpenACC has established itself as a higher-level approach to GPU acceleration that is simple, powerful, and portable. The membership of the OpenACC organization has grown to include accelerator manufacturers, tools vendors, supercomputing centers and education institutions. The OpenACC 2.0 specification significantly expands the functionality and improves the portability of OpenACC and is now available in many commercial tools.

The NVIDIA OpenACC toolkit provides the tools and documentation that scientists and researchers need to be successful with OpenACC. The toolkit includes a free OpenACC compiler for university developers to remove any barriers for use by academics.

The new OpenACC Toolkit includes the following in a single package. Continue reading


12 GTC 2015 Sessions Not to Miss

With one week to go until we all descend on GTC 2015, I’ve scoured through the list of Accelerated Computing sessions and put together 12 diverse “not to miss” talks you should add to your planner. This year, the conference is highlighting the revolution in Deep Learning that will affect every aspect of computing. GTC 2015 includes over 40 session categories, including deep learning and machine learning, scientific visualization, cloud computing, and HPC.

This is the place where scientists, programmers, researchers, and a
myriad of creative professionals convene to tap into the power of a GPU
for more than gaming. –Forbes

Tuesday, March 17

An Introduction to CUDA Programming (S5661)


This is the introductory tutorial intended for those new to CUDA and you will leave with the essential knowledge to start programming in CUDA – no experience is needed! For those that have prior CUDA experience, this is a great session to brush up on key concepts required for subsequent tutorials on CUDA optimization. The other tutorials in this session are: An Introduction to the GPU Memory ModelAsynchronous Operations and Dynamic Parallelism in CUDA and Essential CUDA Optimization Techniques.

GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.
GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.

SMTool: A GPU based Satellite Image Analysis Tool (S5201)


Dilip Patlolla, R&D Engineer in the Geographic Information Science and Technology (GIST) Group at the Oak Ridge National Laboratory, will demonstrate their advanced satellite image analytic tool referred as SMTool built on the CUDA platform to process city-scale sub-meter resolution satellite imagery to detect and discriminate man-made structures. Continue reading

Do More, Code Less with ArrayFire GPU Matrix Library

arrayfire_logo2This is a guest post by Chris McClanahan from ArrayFire (formerly AccelerEyes).

ArrayFire is a fast and easy-to-use GPU matrix library developed by ArrayFire. ArrayFire wraps GPU memory into a simple “array” object, enabling developers to process vectors, matrices, and volumes on the GPU using high-level routines, without having to get involved with device kernel code.

ArrayFire Feature Highlights

  • ArrayFire provides a high-level array notation and an extensive set of functions for easily manipulating N-dimensional GPU data.
  • ArrayFire provides all basic arithmetic operations (element-wise arithmetic, trigonometric, logical operations, etc.), higher-level primitives (reductions, matrix multiply, set operations, sorting, etc.), and even domain-specific functions (image and signal processing, linear algebra, etc.).
  • ArrayFire can be used as a self-contained library, or integrated into and supplement existing CUDA code. The array object can wrap data from CUDA device pointers and existing CPU memory.
  • ArrayFire contains built-in graphics functions for data visualization. The graphics library in ArrayFire provides easy rendering of 2D and 3D data, and leverages CUDA OpenGL interoperation, so visualization is fast and efficient. Various visualization algorithms make easy to explore complex data.
  • ArrayFire offers a unique “gfor” construct that can drastically speed up conventional “for” loops over data. The gfor loop essentially auto-vectorizes the code inside, and executes all iterations of the loop simultaneously.
  • ArrayFire supports C, C++, and Fortran on top of the CUDA platform.
  • ArrayFire is built on top of a custom just-in-time (JIT) compiler for efficient GPU memory usage. The JIT back-end in ArrayFire automatically combines many operations behind the scenes, and executes them in batches to minimize GPU kernel launches.
  • ArrayFire strives to include only the best performing code in the ArrayFire library. This means that the ArrayFire library uses existing implementations of functions when they are faster—such as Thrust for sorting, CULA for linear algebra, and CUFFT for fft. Continue reading

Six Ways to SAXPY

This post is a GPU program chrestomathy. What’s a Chrestomathy, you ask?

In computer programming, a program chrestomathy is a collection of similar programs written in various programming languages, for the purpose of demonstrating differences in syntax, semantics and idioms for each language. [Wikipedia]

There are several good examples of program chrestomathies on the web, including Rosetta Code andNBabel, which demonstrates gravitational N-body simulation in multiple programming languages. In this post I demonstrate six ways to implement a simple SAXPY computation on the CUDA platform. Why is this interesting? Because it demonstrates the breadth of options you have today for programming NVIDIA GPUs, and it covers the three main approaches to GPU computing: GPU-accelerated libraries, GPU compiler directives, and GPU programming languages.

SAXPY stands for “Single-Precision A·X Plus Y”.  It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like this.

void saxpy(int n, float a, float *x, float *y)
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];

// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);

Continue reading

An OpenACC Example (Part 2)

In my previous post I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I’ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our code. This will provide us with significantly higher speedup.

Example Source Code

You can browse and download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1step 2step 3). The examples use OpenACC 1.0 syntax, so you’ll need a compiler that supports it.  I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.

Let’s pick up where we left off last week. Continue reading

An OpenACC Example (Part 1)

In this post I’ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I’ll add just a few more lines to push that speedup over 4x.

Example Source Code

You can download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1step 2). The examples use OpenACC 1.0 syntax. You’ll need a compiler with support for OpenACC.  I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.

Jacobi Iteration

Let’s look at a somewhat more interesting example. This is still a simple program, but it gives us a bit more room to explore various OpenACC directives and options. Jacobi iteration is a standard iterative method for finding solutions to a system of linear equations.  In this post I’ll take an existing, simple sequential CPU implementation of two-dimensional Jacobi iteration and optimize it using OpenMP and OpenACC. The core computational kernel of this example is shown below. The outer while loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, tol. For benchmarking purposes, we have set the variable tol low enough that the outer loop always runs for iter_max iterations (1000). The first set of inner nested loops on lines 5 through 11 apply a 2D Laplace operator at each element of a 2D grid, and the ones on lines 14 through 18 copy the output back to the input for the next iteration.  For our benchmark, we use a grid of 4096×4096 elements.

while ( error > tol && iter < iter_max ) {
    error = 0.f;

    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 1; j < n-1; j++) {
        for( int i = 1; i < m-1; i++ ) {
            Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                                 + A[j-1][i] + A[j+1][i]);
            error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));

    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 0; j < n-1; j++) {
        for( int i = 0; i < m-1; i++ ) {
            A[j][i] = Anew[j][i];   

    if(iter % 100 == 0) printf("%d, %0.6fn", iter, error);


Continue reading

OpenACC: Directives for GPUs

NVIDIA has made a lot of progress with CUDA over the past five years; we estimate that there are over 150,000 CUDA developers, and important science is being accomplished with the help of CUDA. But we have a long way to go to help everyone benefit from GPU computing. There are many programmers who can’t afford the time to learn and apply a parallel programming language. Others, many of them scientists and engineers working with huge existing code bases, can only make changes to their code that are portable across hardware and OS, and that benefit performance on multiple platforms.

This class of developers needs a higher-level approach to GPU acceleration. They need something that is easy, powerful, portable, and open. This is the motivation behind OpenACC, an open standard defining a collection of compiler directives that specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached parallel accelerator, such as an NVIDIA GPU. Directives are simple (typically a single line) hints provided to the compiler. If you are a C or C++ programmer, you are probably familiar with #pragma directives.


Here is a really simple example using OpenACC. This loop performs a SAXPY operation. SAXPY stands for Single-precision A times X Plus YA is a scalar value and X and Y are vectors, so this is a vector scale and add operation. Here is a simple SAXPY in C with an OpenACC directive to parallelize it.

void saxpy_parallel(int n,
  float a,
  float *x,
  float *restrict y)
#pragma acc kernels
  for (int i = 0; i < n; ++i)
    y[i] = a*x[i] + y[i];

Here’s the equivalent in Fortran. Continue reading