Combine OpenACC and Unified Memory for Productivity and Performance

The post Getting Started with OpenACC covered four steps to progressively accelerate your code with OpenACC. It’s often necessary to use OpenACC directives to express both loop parallelism and data locality in order to get good performance with accelerators. After expressing available parallelism, excessive data movement generated by the compiler can be a bottleneck, and correcting this by adding data directives takes effort. Sometimes expressing proper data locality is more effort than expressing parallelism with loop directives.

Wouldn’t it be nice if programs could manage data locality automatically? Well, this is possible today with Unified Memory (on Kepler and newer GPU architectures). In this post I demonstrate how to combine OpenACC with Unified Memory to GPU-accelerate your existing applications with minimal effort. You can download the source code for the example in this post from the Parallel Forall GitHub repository.

Jacobi Iteration with Heap Memory

I’ll use the popular Jacobi iteration example code which is representative of many real-world stencil computations. In contrast to the previous OpenACC post, I modified the array data allocation to use heap memory instead of using automatic stack-allocated arrays. This is a more common scenario for real applications since real-world data arrays are often too large for stack memory. This change also makes it a more challenging case for OpenACC since the compiler no longer knows the size of the arrays. The following excerpt shows the main loop of the Jacobi iteration with 2D index computation. Continue reading


Leveraging OpenACC to Compute High-Accuracy Molecular Energies

Janus Juul EriksenFor this interview, I reached out to Janus Juul Eriksen, a Ph.D. fellow at Aarhus University in Denmark. Janus is a chemist by trade without any formal education in computer science; but he is getting up to 12x speed-up compared to his CPU-only code after modifying less than 100 lines of code with one week of programming effort.

How did he do this? He used OpenACC.

OpenACC is a simple, powerful and portable approach for researchers and scientists who need to rapidly boost application performance for faster science while minimizing programming. With OpenACC, the original source code is kept intact, making the implementation intuitively transparent and leaving most of the hard work to the compiler.

NVIDIA recently announced the new OpenACC Toolkit, an all-in-one suite of parallel programming tools, that helps researchers and scientists quickly accelerate applications.

“OpenACC is much easier to learn than OpenMP or MPI. It makes GPU computing approachable for domain scientists,” says Janus. “Our initial OpenACC implementation required only minor efforts, and more importantly, no modifications of our existing CPU implementation.”

Figure 1: The active site of the Bacterial Leucine Transporter (LeuT) protein, on which the LSDalton team has been doing a number of high-accuracy calculations. The LeuT protein belongs to a family of transporters that are linked to neurological diseases.
Figure 1: The active site of the Bacterial Leucine Transporter (LeuT) protein, on which the LSDalton team has been doing a number of high-accuracy calculations. The LeuT protein belongs to a family of transporters that are linked to neurological diseases.

Janus is part of the research team developing the quantum chemistry code LSDalton, a massively parallel and linear-scaling program for the accurate determination of energies and other molecular properties for large molecular systems.

In need of speed, the LSDalton team was awarded an INCITE allocation which gave them access to Oak Ridge National Laboratory’s Titan supercomputer. With this, they needed to find a way to use the power of the supercomputer: enter OpenACC. Demonstrating success on Titan with their GPU-accelerated code, they were recently one of 13 application code projects selected to join the Center for Accelerated Application Readiness (CAAR) program. This means they will be among the first applications to run on Summit, the new supercomputer debuting in 2018 which will deliver more than five times the computational performance of Titan’s 18,688 nodes.

This access will enable the research team to simulate larger molecular structures at higher accuracy, ultimately accelerating discoveries in materials and quantum chemistry.

Continue reading


Getting Started with OpenACC

This week NVIDIA has released the NVIDIA OpenACC Toolkit, a starting point for anyone interested in using OpenACC. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing without significant programming effort. The toolkit includes the PGI OpenACC Compiler, the NVIDIA Visual Profiler with CPU and GPU profiling, and the new OpenACC Programming and Best Practices Guide. Academics can get a free renewable license to the PGI C,C++ and Fortran compilers with support for OpenACC.

Figure 1: LS-DALTON: Benchmark on Oak Ridge Titan Supercomputer, AMD CPU vs Tesla K20X GPU. Test input: Alanine-3 on CCSD(T) module. Additional information: NICAM COSMO
Figure 1: LS-DALTON: Benchmark on Oak Ridge Titan Supercomputer, AMD CPU vs Tesla K20X GPU. Test input: Alanine-3 on CCSD(T) module. Additional information: NICAM COSMO

OpenACC is an open specification for compiler directives for parallel programming. By using OpenACC, developers can rapidly accelerate existing C, C++, and Fortran applications using high-level directives that help retain application portability across processor architectures. Figure 1 shows some examples of real code speedups with OpenACC. The OpenACC specification is designed and maintained with the cooperation of many industry and academic partners, such as Cray, AMD, PathScale, University of Houston, Oak Ridge National Laboratory and NVIDIA.

When I program with and teach OpenACC I like to use a 4 step cycle to progressively accelerate the code.

  1. Identify Parallelism: Profile the code to understand where the program is spending its time and how much parallelism is available to be accelerated in those important routines. GPUs excel when there’s a significant amount of parallelism to exploit, so look for loops and loop nests with a lot of independent iterations.
  2. Express Parallelism: Placing OpenACC directives on the loops identified in step 1 tells the compiler to parallelize them. OpenACC is all about giving the compiler enough information to effectively accelerate the code, so during this step I add directives to as many loops as I believe I can and move as much of the computation to the GPU as possible.
  3. Express Data Locality: The compiler needs to know not just what code to parallelize, but also which data will be needed on the accelerator by that code. After expressing available parallelism, I often find that the code has slowed down. As you’ll see later in this post, this slowdown comes from the compiler making cautious decisions about when data needs to be moved to the GPU for computation. During this step, I’ll express to the compiler my knowledge of when and how the data is really needed on the GPU.
  4. Optimize – The compiler usually does a very good job accelerating code, but sometimes you can get more performance by giving the compiler a little more information about the loops or by restructuring the code to increase parallelism or improve data access patterns. Most of the time this leads to small improvements, but sometimes gains can be bigger.

Continue reading


Introducing the NVIDIA OpenACC Toolkit

Programmability is crucial to accelerated computing, and NVIDIA’s CUDA Toolkit has been critical to the success of GPU computing. Over 3 million CUDA Toolkits have been downloaded since its first launch. However there are many scientists and researchers yet to benefit from GPU computing. These scientists have limited time to learn and apply a parallel programming language, and they often have huge existing code bases that must remain portable across platforms. Today NVIDIA is introducing the new OpenACC Toolkit to help these researchers and scientists achieve science and engineering goals faster.

Over the last few years OpenACC has established itself as a higher-level approach to GPU acceleration that is simple, powerful, and portable. The membership of the OpenACC organization has grown to include accelerator manufacturers, tools vendors, supercomputing centers and education institutions. The OpenACC 2.0 specification significantly expands the functionality and improves the portability of OpenACC and is now available in many commercial tools.

The NVIDIA OpenACC toolkit provides the tools and documentation that scientists and researchers need to be successful with OpenACC. The toolkit includes a free OpenACC compiler for university developers to remove any barriers for use by academics.

The new OpenACC Toolkit includes the following in a single package. Continue reading


Porting Scientific Applications to GPUs at the OLCF OpenACC Hackathon

Dr. Misun Min
Dr. Misun Min of the Argonne National Laboratory

Six scientific computing teams from around the world spent an intense week late last year porting their applications to GPUs using OpenACC directives. The Oak Ridge Leadership Computing Facility (OLCF) hosted its first ever OpenACC Hackathon in Knoxville, Tennessee. Paired with two GPU mentors, each team of scientific developers set forth on the journey to accelerate their code with GPUs.

Dr. Misun Min, a computational scientist at Argonne National Laboratory, led the NekCEM Team and she shared the results of accelerating NekCEM with OpenACC and NVIDIA GPUDirect™ communication.

Who were the NekCEM hackathon team members, and how much GPU computing experience did your team have?

I have only six months experience; but at the time of the Hackathon, I didn’t really have any. The other members included Matthew Otten from Cornell University (six months GPU computing experience), Jing Gong from KTH in Sweden (two years of OpenACC experience), and Azamat Mametjanov from Argonne. The team also had close interactions with Nek5000 developer Paul Fischer at UIUC for useful discussions.

Who were your mentors?

Two mentors from Cray Inc.: Aaron Vose and John Levesque. Aaron and John provided strong technical support to boost the performance of a GPU-enabled NekCEM version.

First, what is NekCEM?

NekCEM (Nekton for Computational ElectroMagnetics) is an open-source code designed for predictive modeling of electromagnetic systems, such as linear accelerators, semiconductors, plasmonic devices, and quantum systems described by the Maxwell, Helmholtz, drift-diffusion, and Schrödinger or density matrix equations. The code is based on high-order discretizations of the underlying partial differential equations using spectral element (SE) and spectral-element discontinuous Galerkin (SEDG) schemes that have been shown to require order-of-magnitude fewer grid points than do conventional low-order schemes for the same accuracy. NekCEM uses globally unstructured meshes comprising body-fitted curvilinear hexahedral elements, which allow the discrete operators to be expressed as matrix-matrix products applied to arrays of the tensor product basis of Lagrange interpolation polynomials on the Gauss-Lobatto-Legendre quadrature points. The tight coupling of the degrees of freedom within elements leads to efficient data reuse while requiring boundary-minimal (unit-depth-stencil) data communication to effect flux exchanges between neighboring elements.

What were your team’s goals going into the OpenACC Hackathon?

The team had two goals: (1) to develop a high-performance GPU-based operational variant of NekCEM that supports the full functionality of the existing CPU-only code in Fortran/C and (2) to perform analysis to find performance bottlenecks and infer potential scalability for GPU-based architectures of the future. Continue reading


12 GTC 2015 Sessions Not to Miss

With one week to go until we all descend on GTC 2015, I’ve scoured through the list of Accelerated Computing sessions and put together 12 diverse “not to miss” talks you should add to your planner. This year, the conference is highlighting the revolution in Deep Learning that will affect every aspect of computing. GTC 2015 includes over 40 session categories, including deep learning and machine learning, scientific visualization, cloud computing, and HPC.

This is the place where scientists, programmers, researchers, and a
myriad of creative professionals convene to tap into the power of a GPU
for more than gaming. –Forbes

Tuesday, March 17

An Introduction to CUDA Programming (S5661)


This is the introductory tutorial intended for those new to CUDA and you will leave with the essential knowledge to start programming in CUDA – no experience is needed! For those that have prior CUDA experience, this is a great session to brush up on key concepts required for subsequent tutorials on CUDA optimization. The other tutorials in this session are: An Introduction to the GPU Memory ModelAsynchronous Operations and Dynamic Parallelism in CUDA and Essential CUDA Optimization Techniques.

GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.
GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.

SMTool: A GPU based Satellite Image Analysis Tool (S5201)


Dilip Patlolla, R&D Engineer in the Geographic Information Science and Technology (GIST) Group at the Oak Ridge National Laboratory, will demonstrate their advanced satellite image analytic tool referred as SMTool built on the CUDA platform to process city-scale sub-meter resolution satellite imagery to detect and discriminate man-made structures. Continue reading

GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.

Learn GPU Computing with Hands-On Labs at GTC 2015

Every year NVIDIA’s GPU Technology Conference (GTC) gets bigger and better. One of the aims of GTC is to give developers, scientists, and practitioners opportunities to learn with hands-on labs how to use accelerated computing in their work. This year we are nearly doubling the amount of hands-on training provided from last year, with almost 2,400 lab hours available to GTC attendees!

We have two types of training this year at GTC: instructor-led labs and self-paced labs. And to help you keep up with one of the hottest trends in computing, this year we’re featuring a Deep Learning training track. Keep reading for details. If you haven’t registered for GTC yet this year, keep reading for a discount code.

Deep Learning Track

There is an explosion of Deep Learning topics at GTC, and it’s not limited to the keynotes, talks and tutorial sessions. We’ll feature at least six hands-on labs related to accelerating facets of Deep Learning on GPUs. From an introduction to Deep Learning on GPUs to cutting-edge techniques and tools, there will be something for everyone. Be sure to get to these labs early to get yourself a seat! Here are a few of the labs available in this track:

  • Introduction to Machine Learning with GPUs: Handwritten digit classification (S5674)
  • DIY Deep Learning for Vision with Caffe (S5647)
  • Applied Deep Learning for Vision, Natural Language and Audio with Torch7 (S5574)
  • Deep Learning with the Theano Python Library (S5732)
  • Deep Belief Networks Using ArrayFire (S5722)
  • Accelerate a Machine Learning C++ example with Thrust (S5822)

Instructor-led Labs

IMAG0568Just like GTC last year, there will be twenty hands-on instructor-led labs. These are 80-minute labs led by an expert on the topic. Continue reading

Qwiklabs Logo

Learn GPU Programming in Your Browser with NVIDIA Hands-On Labs

As CUDA Educator at NVIDIA, I work to give access to massively parallel programming education & training to everyone, whether or not they have access to GPUs in their own machines. This is why, in partnership with qwikLABS, NVIDIA has made the hands-on content we use to train thousands of developers at the Supercomputing Conference and the GPU Technology Conference online and accessible from anywhere with an internet connection. Using any supported browser, you can easily get started learning how to program for massively parallel GPUs at

Using the powerful IPython Notebook technology, NVIDIA hands-on labs are immersive, self-paced experiences that run on real GPUs in the cloud.  Lab instructions, editing and execution of code, and even interaction with visual tools are all weaved together into a single web application.


Continue reading


3 Versatile OpenACC Interoperability Techniques

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. An often-overlooked feature of OpenACC is its ability to interoperate with the broader parallel programming ecosystem. In this post I’ll teach you 3 powerful interoperability techniques for combining OpenACC and CUDA: the host_data construct, the deviceptr clause, and the acc_map_data() API function.

OpenACC InteropI’ll demonstrate these techniques with several examples of mixing OpenACC with CUDA C++, CUDA Fortran, Thrust, and GPU-accelerated libraries. If you’d like to follow along at home, grab the source code for the examples from Github and try them out with your OpenACC compiler and the CUDA Toolkit. Don’t have an OpenACC compiler? You can download a free 30-day trial of the PGI accelerator compiler.

You may already be thinking to yourself, “If OpenACC is so great, why would I need to use it with CUDA?” OpenACC interoperability features open the door to the GPU-computing ecosystem, allowing you to leverage more than 10 years of code development. Need to multiply two matrices together? Don’t write your own function, just call the cuBLAS library, which has been heavily optimized for GPUs. Does your colleague already have a CUDA routine that you could use in your code? Use it! Interoperability means that you can always use the best tool for the job in any situation. Accelerate your application using OpenACC, but call an optimized library. Expand an existing CUDA application by adding OpenACC to unaccelerated routines. Your choice isn’t OpenACC or CUDA, it’s OpenACC and CUDA. Continue reading


CUDA Pro Tip: Profiling MPI Applications

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option --display-map, it’s tedious and error prone. A solution which solves this for the command-line output of nvprof is described here . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler (nvvp).

Profiling MPI applications with nvprof and nvvp

Collecting data with nvprof

nvprof supports dumping the profile to a file which can be later imported into nvvp. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “%p” in the output file name. nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string “%q{ENV}” can be used to name the output file of nvprof. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (mpirun or mpiexec). E.g. for OpenMPI OMPI_COMM_WORLD_RANK is set to the MPI rank for each launched process.

$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./simpleMPI
Running on 2 nodes
==18811== NVPROF is profiling process 18811, command: ./simpleMPI
==18813== NVPROF is profiling process 18813, command: ./simpleMPI
Average of square roots is: 0.667279
==18813== Generated result file: simpleMPI.1.nvprof
==18811== Generated result file: simpleMPI.0.nvprof

Continue reading