*You may want to read *the more* recent post Getting Started with OpenACC by Jeff Larkin.*

In this post I’ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I’ll add just a few more lines to push that speedup over 4x.

## Example Source Code

You can download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1, step 2). The examples use OpenACC 1.0 syntax. You’ll need a compiler with support for OpenACC. I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.

## Jacobi Iteration

Let’s look at a somewhat more interesting example. This is still a simple program, but it gives us a bit more room to explore various OpenACC directives and options. Jacobi iteration is a standard iterative method for finding solutions to a system of linear equations. In this post I’ll take an existing, simple sequential CPU implementation of two-dimensional Jacobi iteration and optimize it using OpenMP and OpenACC. The core computational kernel of this example is shown below. The outer while loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, `tol`

. For benchmarking purposes, we have set the variable `tol`

low enough that the outer loop always runs for `iter_max`

iterations (1000). The first set of inner nested loops on lines 5 through 11 apply a 2D Laplace operator at each element of a 2D grid, and the ones on lines 14 through 18 copy the output back to the input for the next iteration. For our benchmark, we use a grid of 4096×4096 elements.

while ( error > tol && iter < iter_max ) {
error = 0.f;
#pragma omp parallel for shared(m, n, Anew, A)
for( int j = 1; j < n-1; j++) {
for( int i = 1; i < m-1; i++ ) {
Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
+ A[j-1][i] + A[j+1][i]);
error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
}
}
#pragma omp parallel for shared(m, n, Anew, A)
for( int j = 0; j < n-1; j++) {
for( int i = 0; i < m-1; i++ ) {
A[j][i] = Anew[j][i];
}
}
if(iter % 100 == 0) printf("%d, %0.6fn", iter, error);
iter++;
}

Continue reading →