An OpenACC Example (Part 2)

In my previous post I added 3 lines of OpenACC directives to a Jacobi iteration code, achieving more than 2x speedup by running it on a GPU. In this post I’ll continue where I left off and demonstrate how we can use OpenACC directives clauses to take more explicit control over how the compiler parallelizes our code. This will provide us with significantly higher speedup.

Example Source Code

You can browse and download all source code from the examples in this post from the Parallel Forall GitHub repository. The directory for this post has subdirectories for each “step” in this post (and the next one) with their own source and Makefiles so you can see and try out the changes in each step (step 1step 2step 3). The examples use OpenACC 1.0 syntax, so you’ll need a compiler that supports it.  I’m using the PGI compiler version 12.2 with preliminary support for OpenACC.

Let’s pick up where we left off last week.

STEP 3: Tuning Parallelization Configuration

To refresh, this is what our C code looks like after step 2 from the last post.

#pragma acc data copy(A, Anew)
    while ( error > tol && iter < iter_max )
    {
  error = 0.f;

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
  for( int j = 1; j < n-1; j++)
  {
for( int i = 1; i < m-1; i++ )
{
    Anew[j][i] = 0.25f * ( A[j][i+1] + A[j][i-1]
 + A[j-1][i] + A[j+1][i]);
    error = fmaxf( error, fabsf(Anew[j][i]-A[j][i]));
}
  }

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels
  for( int j = 1; j < n-1; j++)
  {
for( int i = 1; i < m-1; i++ )
{
    A[j][i] = Anew[j][i];    
}
  }

  if(iter % 100 == 0) printf("%5d, %0.6fn", iter, error);

  iter++;
    }

    double runtime = GetTimer();

    printf(" total: %f sn", runtime / 1000.f);
}

Here’s the Fortran code.

!$acc data copy(A, Anew)
  do while ( error .gt. tol .and. iter .lt. iter_max )
    error=0.0_fp_kind

!$omp parallel do shared(m, n, Anew, A) reduction( max:error )
!$acc kernels
    do j=1,m-2
do i=1,n-2
  Anew(i,j) = 0.25_fp_kind * ( A(i+1,j  ) + A(i-1,j  ) + &
 A(i  ,j-1) + A(i  ,j+1) )
  error = max( error, abs(Anew(i,j)-A(i,j)) )
end do
    end do
!$acc end kernels
!$omp end parallel do

    if(mod(iter,100).eq.0 ) write(*,'(i5,f10.6)'), iter, error
    iter = iter +1

!$omp parallel do shared(m, n, Anew, A)
!$acc kernels
    do j=1,m-2
do i=1,n-2
  A(i,j) = Anew(i,j)
end do
    end do
!$acc end kernels
!$omp end parallel do

  end do
!$acc end data

To begin, let’s run the C code from step 2 on the GPU again, this time enabling -ta=nvidia,time. We see output like the following from the PGI compiler.

total: 8.940413 s

Accelerator Kernel Timing data
/home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c
  main
    92: region entered 1000 times
  time(us): total=2336562 init=298 region=2336264
kernels=2295838 data=0
  w/o init: total=2336264 max=2441 min=2318 avg=2336
  95: kernel launched 1000 times
grid: [256x256]  block: [16x16]
time(us): total=2295838 max=2301 min=2293 avg=2295
/home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c
  main
    80: region entered 1000 times
  time(us): total=6489988 init=190 region=6489798
kernels=6184945 data=0
  w/o init: total=6489798 max=6960 min=6397 avg=6489
  83: kernel launched 1000 times
grid: [256x256]  block: [16x16]
time(us): total=6072138 max=6186 min=6012 avg=6072
  87: kernel launched 1000 times
grid: [1]  block: [256]
time(us): total=112807 max=221 min=111 avg=112
/home/mharris/src/parallel-forall/code-samples/posts/002-openacc-example/step2/laplace2d.c
  main
    74: region entered 1 time
  time(us): total=8936004
data=96980
acc_init.c
  acc_init
    29: region entered 1 time
  time(us): init=82801

We can see 3 regions. The first one is the “memcpy” loop nest starting on line 92, which takes about 2.3s out of 8.9s total. The second one is the main computation loop region starting on line 80, which takes about 6.5s. The third region mentioned is the enclosing while loop data region starting on line 74. We can see that this region takes 8.9s, which is good—that means there is no other part of the program that takes significant time. If we look at the main loop nests, we can see these lines:

grid: [256x256]  block: [16x16]

The terms grid and block come from the CUDA programming model. The GPU executes groups of threads called thread blocks. To execute a kernel, the application launches a grid of these thread blocks. Each block runs on one of the GPU’s multiprocessors and is assigned a certain range of IDs that it uses to address a unique data range. In this case our thread blocks have 256 threads each, arranged in 16×16 two-dimensional blocks. The grid is also 2D, 256 blocks wide and 256 blocks tall. This is just enough to cover our 4096×4096 grid. But we don’t really need that many blocks—if we tell the compiler to launch fewer, it will automatically generate a sequential loop over data blocks within the kernel code run by each thread.

I have tried this and found that it is more efficient for this kernel. Also, blocks 32 threads wide execute this code faster than 16-wide blocks, because the GPU processes instructions 32 threads at a time (a group of 32 threads is called a warp in the CUDA architecture). To specify these dimensions, let’s modify the code for the main computational loop nest as follows.

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels loop gang(32), vector(16)
        for( int j = 1; j < n-1; j++) {
#pragma acc loop gang(16), vector(32)
            for( int i = 1; i < m-1; i++ )

The gang(32) clause on the outer loop tells the compiler to launch 32 blocks in the Y (row) direction.  Thegang(16) clause on the inner loop tells it to launch 16 blocks in the X (column) direction. The vector(16) clause on the outer loop tells the compiler to use blocks that are 16 threads tall, thus processing the loop iterations in SIMD groups of 16. Finally, the vector(32) clause on the inner loop tells the compiler to use blocks that are 32 threads wide (one warp wide). I also found that making a similar change to the copy loop nest benefits performance a small amount. In this case I found it best to only apply the gang and vector clauses to the inner loop, to ensure the blocks used are as wide as a warp. Here are the changes to the C code.

#pragma omp parallel for shared(m, n, Anew, A)
#pragma acc kernels loop
        for( int j = 1; j < n-1; j++) {
#pragma acc loop gang(16), vector(32)           
            for( int i = 1; i < m-1; i++ )

The changes to the Fortran code are very similar. Here is the first loop nest.

!$omp parallel do shared(m, n, Anew, A) reduction( max:error )
!$acc kernels loop gang(32), vector(16)
    do j=1,m-2
!$acc loop gang(16), vector(32)
do i=1,n-2
  Anew(i,j) = 0.25_fp_kind * ( A(i+1,j  ) + A(i-1,j  ) + &
 A(i  ,j-1) + A(i  ,j+1) )
  error = max( error, abs(Anew(i,j)-A(i,j)) )
end do
!$acc end loop
    end do
!$acc end kernels
!$omp end parallel do

And here is the second Fortran loop nest.

!$omp parallel do shared(m, n, Anew, A)
!$acc kernels loop
    do j=1,m-2
!$acc loop gang(16), vector(32)
do i=1,n-2
  A(i,j) = Anew(i,j)
end do
!$acc end loop
    end do
!$acc end kernels
!$omp end parallel do

While we’re at it, let’s make one other small change. We don’t need to copy Anew  to and from the GPU, since it is only accessed on the GPU.  We can modify the data region directive to specify that Anew is allocated only on the GPU using the create clause on the data directive.

#pragma acc data copy(A), create(Anew)

Here is the performance after these changes.

Execution Time (s) Speedup vs. 1 CPU thread Speedup vs. 4 CPU threads
CPU 1 thread 34.14
CPU 4 threads 21.16 1.61x 1.0x
GPU 5.32 6.42x 3.98x

We’ve increased performance of this application by 4x (comparing CPU “socket” to GPU “socket”) by adding just a few lines of compiler directives!

Wrapping Up

OpenACC is a promising new standard that I think will be a powerful tool for high-productivity acceleration of existing C and Fortran code. It will also be very valuable for prototyping; programmers can use it to determine if it is worthwhile to spend more time writing lower-level CUDA code for GPUs, for example. I hope you have found this overview interesting—please let me know what you think in the comments! I’ll revisit OpenACC in a future post, possibly to discuss some of the challenges you might face in applying OpenACC to more complicated codes, or to provide some guidelines on getting the best results.

If you are interested in trying OpenACC you can download a free trial of PGI Accelerator to try it out. The OpenACC version 1.0 specification and a reference card are available here.

∥∀

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter