In the Trenches at GTC: Programming GPUs with OpenACC

By Adnan Boz (GTC 2012 Guest Blogger)

It’s my first day at the GPU Technology Conference and I’ve already had the opportunity to meet gurus like Mark Harris (Chief Technologist, GPU Computing, NVIDIA, and founder of GPGPU.org) and learn about the latest advancements in the GPU and HPC arena from people like NVIDIA’s Will Ramey and Duncan Poole.

One of the hot topics so far is OpenACC, an open GPU directives standard that makes GPU programming straightforward and portable across parallel and multi-core processors (see: http://developer.nvidia.com/openacc).

OpenACC is as simple as adding the “#pragma acc kernels” directive to a parallelizable part of your code. (If you are familiar with OpenMP you may have noticed the similarity to using “#pragma omp parallel”. )

The best part is that we can use OpenACC with current CUDA libraries and programming languages, like CuFFT,CuBLASCuRANDCuSPARSENPPThrust and others (see: http://developer.nvidia.com/gpu-accelerated-libraries).

An important point Mark Harris made in his presentation is how to analyze and optimize accelerator directives usage: for example, using the PGI_ACC_TIME environment variable as well as –Minfo=accel pgcc compiler option to determine where time is being spent.

Using these options, Mark identified a memory transfer bottleneck in the Jacobi Iteration example. I think it was a great example of how we can write programs in a wrong way and end up having a very slow-running application.

It is always important to know the capabilities of the tool we are using. In this case, Mark used data constructs, specifically the “#pragma acc data” directive,which allows managing data movementto achieve a 3x improvement compared to CPU alternatives.

If you are planning to step into using OpenACC, here are some tips and tricks:

  • Eliminate pointer arithmetic (like using subscribed arrays instead of pointer-index arrays);
  • Inline function calls in directive regions;
  • Use contiguous memory for multi-dimensional arrays;
  • Avoid excessive memory transfers by using data regions;
  • Apply directives to (nested) for loops for best parallelization;
  • Use array shaping to indicate data size to the complier.

That’s all for now from Day One at GTC! Be sure to watch the streamcast of this presentation.

About our Guest Blogger

adnan_bozAdnan Boz is an NVIDIA certified CUDA programmer working at Yahoo! He is also the organizer of the South Florida GPU Meetup.

∥∀

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter