An OpenACC Example (Part 1)

In this post I’ll continue where I left off in my introductory post about OpenACC and provide a somewhat more realistic example. This simple C/Fortran code example demonstrates a 2x speedup with the addition of just a few lines of OpenACC directives, and in the next post I’ll add just a few more lines to push that speedup over 4x.

Example Source Code

You can download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1step 2). The examples use OpenACC 1.0 syntax. You’ll need a compiler with support for OpenACC.  I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.

Jacobi Iteration

Let’s look at a somewhat more interesting example. This is still a simple program, but it gives us a bit more room to explore various OpenACC directives and options. Jacobi iteration is a standard iterative method for finding solutions to a system of linear equations.  In this post I’ll take an existing, simple sequential CPU implementation of two-dimensional Jacobi iteration and optimize it using OpenMP and OpenACC. The core computational kernel of this example is shown below. The outer while loop iterates until the solution has converged, by comparing the computed error to a specified error tolerance, tol. For benchmarking purposes, we have set the variable tol low enough that the outer loop always runs for iter_max iterations (1000). The first set of inner nested loops on lines 5 through 11 apply a 2D Laplace operator at each element of a 2D grid, and the ones on lines 14 through 18 copy the output back to the input for the next iteration.  For our benchmark, we use a grid of 4096×4096 elements.

while ( error > tol && iter < iter_max ) {
    error = 0.f;

    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 1; j < n-1; j++) {
        for( int i = 1; i < m-1; i++ ) {
            Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
                                 + A[j-1][i] + A[j+1][i]);
            error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
        }
    }

    #pragma omp parallel for shared(m, n, Anew, A)
    for( int j = 0; j < n-1; j++) {
        for( int i = 0; i < m-1; i++ ) {
            A[j][i] = Anew[j][i];   
        }
    }

    if(iter % 100 == 0) printf("%d, %0.6fn", iter, error);

    iter++;
}

Here is the equivalent Fortran code.

do while ( error .gt. tol .and. iter .lt. iter_max )
    error=0.0_fp_kind

!$omp parallel do shared(m, n, Anew,A) reduction( max:error )
    do j=1,m-2
do i=1,n-2
  Anew(i,j) = 0.25_fp_kind *( A(i+1,j  ) + A(i-1,j  ) + &
A(i  ,j-1) + A(i  ,j+1) )
  error = max( error, abs(Anew(i,j)-A(i,j)) )
end do
    end do
!$end omp parallel do

    if(mod(iter,100).eq.0 ) write(*,'(i5,f10.6)'), iter, error
    iter = iter +1

!$omp parallel do shared(m, n, Anew, A)
    do j=1,m-2
do i=1,n-2
  A(i,j) = Anew(i,j)
end do
    end do
!$end omp parallel do

end do

To make sure we’re using all the cores of our CPU for a fair comparison, we use OpenMP parallel for directives (the#pragma statements in the C code and !$omp statements in the Fortran code) on each loop nest. The omp parallelfor directive instructs the compiler to parallelize the following loop using CPU threads (the number of threads is specified in the OMP_NUM_THREADS environment variable). We also use PGI’s -fast compiler option to get the best CPU code optimizations, including generation of CPU vector instructions (e.g. SSE) where possible. This option makes a big difference to performance, and in the case of this code it does not noticeably affect accuracy. The PGI compiler also automatically replaces the innermost copy loop with an optimized memory copy. The following table shows the performance of the C code using 1, 2, and 4 threads on a system with a quad-core Intel Xeon X5550 CPU (2.66 GHz).

Execution Time (s) Speedup vs. 1 CPU Thread
CPU 1 thread 34.14
CPU 2 threads 22.58 1.51x
CPU 4 threads 21.16 1.61x

As you can see, using all 4 cores of the CPU for this code is not 4 times faster. That’s because it is bandwidth bound. I won’t claim that this code is extremely efficient CPU code—it could be optimized, for example by splitting the loops for cache blocking, among other things. However, I think that the code is representative of C code an average scientist or engineer would write. We are at least using OpenMP and an optimizing compiler to make sure we are using all CPU cores and vector instructions, so it makes a good starting point and comparison for applying GPU directives. As we’ll see in Step 2, below, the PGI Accelerator compiler generates cache blocking GPU code for us automatically by using the GPU’s on-chip shared memory, so we don’t have to think about changing our loops.

STEP 1: Our First GPU Directives

Let’s just drop in a single, simple OpenACC directive before each of our for loop nests in the previous code. Just after the #pragma omp lines, we add the following. (We leave the OpenMP directive in place so we retain portability to multicore CPUs.)

#pragma acc kernels

This tells the compiler to generate parallel accelerator kernels (CUDA kernels in our case) for the loop nests following the directive. To compile this code, we tell the PGI compiler to compile OpenACC directives and target NVIDIA GPUs using the -acc -ta=nvidia command line options (-ta=nvidia means “target architecture = NVIDIA GPU”). We can also enable verbose information about the parallelization with the -Minfo=accel option. If you are interested in the details, check out the Appendix below.

Let’s run the program. Our test PC has an Intel Xeon X5550 CPU and an NVIDIA Tesla M2090 GPU. When I run it, I see that it takes about 75 seconds. Uh oh, that’s a bit of a slow-down compared to my parallel CPU code. What’s going on? We can modify the compilation command to enable built-in profiling by specifying -ta=nvidia,time. Now when I run it I get:

total: 78.479634 s

Accelerator Kernel Timing data
/home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step1/laplace2d.c
  main
    86: region entered 1000 times
  time(us): total=35877077 init=410 region=35876667
kernels=2212582 data=31135910
  w/o init: total=35876667 max=41541 min=34105 avg=35876
  89: kernel launched 1000 times
grid: [256x256]  block: [16x16]
time(us): total=2212582 max=2311 min=2206 avg=2212
/home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step1/laplace2d.c
  main
    74: region entered 1000 times
  time(us): total=42595385 init=2082061 region=40513324
kernels=6278602 data=31385120
  w/o init: total=40513324 max=64020 min=38577 avg=40513
  77: kernel launched 1000 times
grid: [256x256]  block: [16x16]
time(us): total=6165522 max=6181 min=6133 avg=6165
  81: kernel launched 1000 times
grid: [1]  block: [256]
time(us): total=113080 max=221 min=112 avg=113

We can quickly see the problem: both regions (indicated by 86 and 74 above) spend the majority of their time copying data:

    86: region entered 1000 times
  time(us): total=35877077 init=410 region=35876667
kernels=2212582 data=31135910
    74: region entered 1000 times
  time(us): total=42595385 init=2082061 region=40513324
kernels=6278602 data=31385120

Of the 78 or so seconds spent in the parallel regions, we’re spending over 62 seconds (31,135,910us + 31,385,120us) moving data around.  The reason is that the compiler doesn’t know that we are using those arrays repeatedly on the device.  So each time through the outer while loop, when it encounters the acc kernels directives, it must copy data used on the accelerator from the host, and then copy it back after the directive region ends. That means we are copying the A and Anew arrays back and forth twice for each of the 1000 iterations of the outer while loop. There’s no need for this—we don’t need the results until after the while loop exits. Let’s fix it by making the compiler aware of the data we need on the device.

Step 2: Efficient data management

We need to move the data copies outside the while loop, and OpenACC makes this easy. We just put an acc data directive just before the while loop:

#pragma acc data copy(A, Anew)
    while ( error > tol && iter < iter_max )
    {
  // Code inside while loop doesn't change
    }

Here is the equivalent in Fortran.

!$acc data copy(A, Anew)
    do while ( error .gt. tol .and. iter .lt. iter_max )
  ! Code inside the while loop doesn't change
    end do
!$acc end data

The data copy directive tells the compiler to copy the arrays A and Anew to and from the device before and after the while loop runs. There is also initialization overhead of 2 seconds shown in the profile above. We can make sure this is not included in our timing loop (large applications will only initialize the GPU once) by callingacc_init(acc_device_nvidia) at startup.

After making these changes, our GPU code is much faster—with just 3 lines of OpenACC directives and an initialization call we have made our code more than twice as fast by running on a GPU, as shown in this table (Times are for the C version.)

Execution Time (s) Speedup vs. 1 CPU Thread Speedup vs. 4 CPU Threads
CPU 1 thread 34.14
CPU 4 threads (OpenMP) 21.16 1.61x 1.0x
GPU (OpenACC) 9.02 3.78x 2.35x

We Can Do Even Better

We’ve already gotten a speedup with only 3 lines of directives (6 in Fortran), but we can do better with a bit deeper understanding of the OpenACC execution model, and that’s where I’ll pick up in my next Parallel Forall post.

If you’d like to read deeper before then, the OpenACC 1.0 specification and a reference card are available here. If you are interested in trying OpenACC you can download a free trial of PGI Accelerator to try it out.

||∀

Appendix

I’ll first repeat the code from Step 1 complete with directives so the line numbers in the compiler output below make sense.

while ( error > tol && iter < iter_max )
{
    error = 0.f;

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
    for( int j = 1; j < n-1; j++)
    {
  for( int i = 1; i < m-1; i++ )
  {
Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
   + A[j-1][i] + A[j+1][i]);
error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
  }
    }

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
    for( int j = 1; j < n-1; j++)
    {
  for( int i = 1; i < m-1; i++ )
  {
A[j][i] = Anew[j][i];
  }
    }

    if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error);

    iter++;
}

When we compile with -Minfo=accel, we see a bunch of information about what the compiler has done.

pgcc -I../common -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c
main:
     74, Generating copyin(A[:4095][:4095])
   Generating copyout(Anew[1:4094][1:4094])
   Generating compute capability 1.0 binary
   Generating compute capability 2.0 binary
     75, Loop is parallelizable
     77, Loop is parallelizable
   Accelerator kernel generated
   75, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */
   77, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */
 Cached references to size [18x18] block of 'A'
 CC 1.0 : 16 registers; 1360 shared, 40 constant, 0 local memory bytes; 66% occupancy
 CC 2.0 : 14 registers; 1304 shared, 76 constant, 0 local memory bytes; 100% occupancy
   81, Max reduction generated for error
     86, Generating copyout(A[1:4094][1:4094])
   Generating copyin(Anew[1:4094][1:4094])
   Generating compute capability 1.0 binary
   Generating compute capability 2.0 binary
     87, Loop is parallelizable
     89, Loop is parallelizable
   Accelerator kernel generated
   87, #pragma acc loop worker, vector(16) /* blockIdx.y threadIdx.y */
   89, #pragma acc loop worker, vector(16) /* blockIdx.x threadIdx.x */
 CC 1.0 : 6 registers; 48 shared, 8 constant, 0 local memory bytes; 100% occupancy
 CC 2.0 : 12 registers; 8 shared, 56 constant, 0 local memory bytes; 100% occupancy

The numbers starting some of the lines are line numbers corresponding to the input code. Line 74 is the #pragma acc kernels directive, and you can see that the compiler has figured out what data it needs to copy from the host (CPU) to the device (GPU) when entering the region (the array A), and what it needs to copy from the device to the host when leaving it (the array Anew). The compiler reports that the loop nests starting on lines 75 and 77 are both parallelizable, and that two accelerator kernels (that is, functions to run on the GPU) were generated. Both pairs of loops are vectorized with width 16. This processes the 4096×4096 grid in blocks of 16×16 elements, which allows the blocks to be cached in the GPU’s fast shared memory (it caches 18×18 blocks of A due to the dependencies of elements on their neighbors).

The compiler also indicates that a max reduction was generated for the variable error. A reduction is a parallel algorithm for producing a single output value from a vector of input values. Note that technically we should have used an explicit reduction clause on directive for the first loop nest, but the early implementation of OpenACC in the PGI compiler I used when writing this post did not yet support it (but the compiler is clearly able to auto-detect the reduction). In this case we need the maximum of all the computed error values, and the OpenACC compiler automatically generates efficient parallel code for this.

 

∥∀

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter