Performance Portability from GPUs to CPUs with OpenACC

OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs, and a wide range of accelerators, including APUs, GPUs, and many-core coprocessors.

Register for a free online OpenACC training course!

And starting today, with the PGI Compiler 15.10 release, OpenACC enables performance portability between accelerators and multicore CPUs. The new PGI Fortran, C and C++ compilers for the first time allow OpenACC-enabled source code to be compiled for parallel execution on either a multicore CPU or a GPU accelerator. This capability provides tremendous flexibility for programmers, enabling applications to take advantage of multiple system architectures with a single version of the source code.PGI 15.10

“Our goal is to enable HPC developers to easily port applications across all major CPU and accelerator platforms with uniformly high performance using a common source code base,” said Douglas Miles, director of PGI Compilers & Tools at NVIDIA. “This capability will be particularly important in the race towards exascale computing in which there will be a variety of system architectures requiring a more flexible application programming approach.”

OpenACC Portable PerformanceAs the chart above shows, performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups. Continue reading


Deep Learning for Computer Vision with MATLAB and cuDNN

Deep learning is becoming ubiquitous. With recent advancements in deep learning algorithms and GPU technology, we are able to solve problems once considered impossible in fields such as computer vision, natural language processing, and robotics.

Figure 1: Pet detection and recognition system.
Figure 1: Pet detection and recognition system.

Deep learning uses deep neural networks which have been around for a few decades; what’s changed in recent years is the availability of large labeled datasets and powerful GPUs. Neural networks are inherently parallel algorithms and GPUs with thousands of cores can take advantage of this parallelism to dramatically reduce computation time needed for training deep learning networks. In this post, I will discuss how you can use MATLAB to develop an object recognition system using deep convolutional neural networks and GPUs.

Why Deep Learning for Computer Vision?

Machine learning techniques use data (images, signals, text) to train a machine (or model) to perform a task such as image classification, object detection, or language translation. Classical machine learning techniques are still being used to solve challenging image classification problems. However, they don’t work well when applied directly to images, because they ignore the structure and compositional nature of images. Until recently, state-of-the-art techniques made use of feature extraction algorithms that extract interesting parts of an image as compact low-dimensional feature vectors. These were then used along with traditional machine learning algorithms.

Enter Deep learning. Deep convolutional neural networks (CNNs), a specific type of deep learning algorithm, address the gaps in traditional machine learning techniques, changing the way we solve these problems. CNNs not only perform classification, but they can also learn to extract features directly from raw images, eliminating the need for manual feature extraction. For computer vision applications you often need more than just image classification; you need state-of-the-art computer vision techniques for object detection, a bit of domain expertise, and the know-how to set up and use GPUs efficiently. Through the rest of this post, I will use an object recognition example to illustrate how easy it is to use MATLAB for deep learning, even if you don’t have extensive knowledge of computer vision or GPU programming. Continue reading


Cutting Edge Parallel Algorithms Research with CUDA

LeyuanLeyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two “Distinguished Papers” of the 51 accepted at Euro-Par 2015.  Euro-Par is a European conference devoted to all aspects of parallel and distributed processing held August 24-28 at Austria’s Vienna University of Technology.

Leyuan’s paper Fast Parallel Suffix Array on the GPU, co-authored by her advisor John Owens and Sean Baxter, a research scientist at New York’s DE Shaw Research, details their efforts to implement a linear-time suffix array construction algorithm on NVIDIA GPUs, resulting in algorithmic improvements and significant speedups over the existing state of the art.

Wang completed her master’s degree in electrical and computer engineering at UC Davis in October 2014, after having earned her undergraduate degree in electronics science and technology at China’s Zhejiang University.

Brad: Can you talk a bit about your current research?

Leyuan Wang: I work on high-performance string processing and graph processing algorithms, mostly in string and graph queries. My current research focus is on GPGPU (general-purpose computing on graphics processing units) and the benchmark I care about most is speed. I’ve been working on designing and improving parallel suffix array construction algorithms (SACAs) and incorporating the implementations in a Burrows-Wheeler transform-based lossless data compression (bzip2) and a parallel FM index for pattern searching. The suffix array (SA) of a string is the sorted set of all suffixes of the string. The inverse suffix array (ISA) is also the lexicographic ranks of suffixes.

The Burrows-Wheeler transform (BWT) of a string is generated by lexicographically sorting the cyclic shift of the string to form a string matrix and taking the last column of the matrix. The BWT groups repeated characters together by permuting the string; it is also reversible, which means the original string can be recovered. These two characteristics make BWT a popular choice for a compression pipeline stage (for instance, bzip2). It is directly related to the suffix array: the sorted rows in the matrix are essentially the sorted suffixes of the string and the first column of the matrix reflects a suffix array. Table 1 shows an example of the SA, ISA and BWT of the input string “banana$”

Table 1: SA, ISA and BWT for the example string “banana$”.
Table 1: SA, ISA and BWT for the example string “banana$”.

The suffix array data structure is a building block in a spectrum of applications, including data compression, bioinformatics, text indexing, etc. I’ve studied the taxonomy of all classes of SACAs and compared them in order to find the best candidate for the GPU. I revisited the previous conclusion that skew SACAs are best suited on the GPU by demonstrating that prefix-doubling SACAs are actually better both in theoretical analysis and experimental benchmarks. Our hybrid skew/prefix-doubling suffix array implementation (with our amazing research collaborator Sean Baxter, formerly of NVIDIA Research) using a Tesla K20 achieves a 7.9x speedup against the previous state-of-the-art skew implementation. Our optimized skew SACA implementation has been added as a primitive to CUDPP 2.2 (CUDA Data Parallel Primitives Library) and incorporated into the BWT and bzip2 data compression application, resulting in great speedups compared with bzip2 in CUDPP 2.1. Figure 1 shows pseudocode for our two approaches. Continue reading


Accelerating Materials Discovery with CUDA

In this post, we discuss how CUDA has facilitated materials research in the Department of Chemical and Biomolecular Engineering at UC Berkeley and Lawrence Berkeley National Laboratory. This post is a collaboration between Cory Simon, Jihan Kim, Richard L. Martin, Maciej Haranczyk, and Berend Smit.

Engineering Applications of Nanoporous Materials

Figure 1: The repeating crystal structure of metal-organic framework IRMOF-1. Atom color dictionary = {carbon: gray, oxygen: red, zinc: blue, hydrogen: white}.
Figure 1: The repeating crystal structure of metal-organic framework IRMOF-1. Atom color dictionary = {carbon: gray, oxygen: red, zinc: blue, hydrogen: white}.

Nanoporous materials have nano-sized pores such that only a few molecules can fit inside. Figure 1 shows the chemical structure of metal-organic framework IRMOF-1, just one of the many thousands of nanoporous materials that have been synthesized.

Nanoporous materials have many potential engineering applications based on gas adsorption: the process by which gas molecules adhere to a surface. In this case, the walls of the material’s pores form the surface to which gas molecules stick. Figure 2 shows the unit cell of the IRMOF-1 crystal structure and the corresponding depiction of IRMOF-1 as a raveled-up surface.

If we could unravel and flatten out the surface of IRMOF-1 in Figure 2, the surface area contained in a single gram of it could cover more than a soccer field! This provides a lot of surface area on which gas molecules can adsorb. These high surface areas are part of the reason that nanoporous materials are so promising for many engineering applications.

Figure 2: A nanoporous material can be abstracted as a raveled-up surface. On the left is the unit cell of the IRMOF-1 crystal structure. On the right is a depiction of the surface that IRMOF-1 forms.
Figure 2: A nanoporous material can be abstracted as a raveled-up surface. On the left is the unit cell of the IRMOF-1 crystal structure. On the right is a depiction of the surface that IRMOF-1 forms.

Continue reading


Open, Reproducible Computational Chemistry with Python and CUDA

SONY DSCIncreasingly, computational chemistry researchers use GPUs to push the boundaries of discovery. This motivated Christopher Cooper, an Instructor at Universidad Técnica Federico Santa María in Chile, to move to a Python-based software stack.

Cooper’s recent paper, “Probing protein orientation near charged nanosurfaces for simulation-assisted biosensor design,” was recently accepted in J. Chemical Physics.

Brad: Can you talk a bit about your current research?

Christopher: I am interested in developing fast and accurate algorithms to study the effect of electrostatics in protein systems. We use continuum models to represent the solvent around the protein (water with salt) via the Poisson-Boltzmann equation, and solve it with an accelerated boundary element method. We call the resulting code PyGBe, which is open-source software with an MIT license, and is available to download via the Github account of the research group where I did my Ph.D. at Boston University.

Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.
Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.

Continue reading


Customize CUDA Fortran Profiling with NVTX

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

Two previous Pro Tip posts on Parallel Forall showed how to use NVTX in CUDA C++ and MPI codes. In this post, I’ll show how to use NVTX to annotate the profiles of Fortran codes (with either CUDA Fortran or OpenACC).

NVTX has a lot of features, but here I’ll focus on using it to annotate the profiler output with timeline markers using nvtxRangePush() and nvtxRangePop(). I’ll show you how to insert markers with custom labels and colors. Continue reading

CUDA 7.5

Simple, Portable Parallel C++ with Hemi 2 and CUDA 7.5

The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7 I discussed the importance of C++11 for parallel programming on GPUs, and in the post New Features in CUDA 7.5 I introduced a new experimental feature in the NVCC CUDA C++ compiler: support for GPU Lambda expressions. Lambda expressions, introduced in C++11, provide concise syntax for anonymous functions (and closures) that can be defined in line with their use, can be passed as arguments, and can capture variables from surrounding scopes. GPU Lambdas bring that power and convenience to writing GPU functions, letting you launch parallel work on the GPU almost as easily as writing a for loop.

In this post, I want to show you how modern C++ features combine to enable a higher-level, more portable approach to parallel programming for GPUs. To do so, I’ll show you Hemi 2, the second release of a simple open-source C++ library that I developed to explore approaches to portable parallel C++ programming. I have written before about Hemi on Parallel Forall, but Hemi 2 is easier to use, more portable, and more powerful.

hemi-logo-blogIntroducing Hemi 2

Hemi simplifies writing portable CUDA C/C++ code. With Hemi,

  • you can write parallel kernels like you write for loops—in line in your CPU code—and run them on your GPU;
  • you can launch C++ Lambda functions as GPU kernels;
  • you can easily write code that compiles and runs either on the CPU or GPU;
  • kernel launch configuration is automatic: details like thread block size and grid size are optimization details, rather than requirements.

With Hemi, parallel code for the GPU can be as simple as the parallel_for loop in the following code, which can also be compiled and run on the CPU.

void saxpy(int n, float a, const float *x, float *y)
  hemi::parallel_for(0, n, [=] HEMI_LAMBDA (int i) {
    y[i] = a * x[i] + y[i];

Hemi is BSD-licensed, open-source software, available on Github. Continue reading


Combine OpenACC and Unified Memory for Productivity and Performance

The post Getting Started with OpenACC covered four steps to progressively accelerate your code with OpenACC. It’s often necessary to use OpenACC directives to express both loop parallelism and data locality in order to get good performance with accelerators. After expressing available parallelism, excessive data movement generated by the compiler can be a bottleneck, and correcting this by adding data directives takes effort. Sometimes expressing proper data locality is more effort than expressing parallelism with loop directives.

Wouldn’t it be nice if programs could manage data locality automatically? Well, this is possible today with Unified Memory (on Kepler and newer GPU architectures). In this post I demonstrate how to combine OpenACC with Unified Memory to GPU-accelerate your existing applications with minimal effort. You can download the source code for the example in this post from the Parallel Forall GitHub repository.

Jacobi Iteration with Heap Memory

I’ll use the popular Jacobi iteration example code which is representative of many real-world stencil computations. In contrast to the previous OpenACC post, I modified the array data allocation to use heap memory instead of using automatic stack-allocated arrays. This is a more common scenario for real applications since real-world data arrays are often too large for stack memory. This change also makes it a more challenging case for OpenACC since the compiler no longer knows the size of the arrays. The following excerpt shows the main loop of the Jacobi iteration with 2D index computation. Continue reading


Increasing the Luminosity of Beam Dynamics with GPUs

Adrian_CERNWhat is dark matter? We can neither see it nor detect it with any instrument. CERN is upgrading the LHC (Large Hadron Collider), which is the world’s largest and most powerful particle accelerator ever built, to explore the new high-energy frontier.

The most technically challenging aspects of the upgrade cannot be done by CERN alone and requires collaboration and external expertise. There are 7,000 scientists from over 60 countries working to extend the LHC discovery potential; the accelerator will need a major upgrade around 2020 to increase its luminosity by a factor of 10 beyond the original design value.

Ph.D. student Adrian Oeftiger attends EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland which is one of the High Luminosity LHC beneficiaries. His research group is working to parallelize their algorithms to create software that will offer the possibility of new kinds of beam dynamics studies that have not been possible with the current technology.

Brad: How is your research related to the upgrade of the LHC?

Adrian: My world is all about luminosity; increasing the luminosity of particle beams. It is all about making ultra-high-energy collisions of protons possible, and at the same time providing enough collisions to enable fundamental particle physics research. That means increasing the luminosity. I’m doing my Ph.D. in beam dynamics in the field of accelerator physics.

High Luminosity LHCThese days, high-energy particle accelerators are the tools of choice to analyze and understand the fundamental building blocks of our universe. The huge detectors at the Large Hadron Collider (LHC) at CERN, buried about a hundred meters underground in the countryside near Geneva, need ever-increasing collision rates (hence luminosity!): they gather statistics of collision events to explore new realms of physics, to detect extremely rare interaction combinations and the tiniest quantities of new particles, and to find explanations for some of the numerous wonders of the universe we live in. What is the dark matter which makes up 27% of our universe made of? Why is the symmetry between anti-matter and ordinary matter broken, and why do we find only the latter in the universe?

CERN is preparing for the High Luminosity LHC, a powerful upgrade of the present accelerator to increase the chances to answer some of these fundamental questions. Increasing the chances translates to: we need more collisions, so we need higher luminosity. Continue reading

CUDA 7.5

CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling

[Note: Thejaswi Rao also contributed to the code optimizations shown in this post.]

Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA Toolkit. One of the most exciting new features in CUDA 7.5 is new Instruction-Level Profiling support in the NVIDIA Visual Profiler. This powerful new feature, available on Maxwell (GM200) and later GPUs, helps pinpoint performance bottlenecks, letting you quickly identify the specific lines of source code (and assembly instructions) limiting the performance of GPU code, along with the underlying reason for execution stalls.

In this post, I demonstrate Instruction-Level Profiling by showing how it helped understand and improve the performance limitations of a CUDA kernel that implements the Iterative Closest Point algorithm (the original source code, by Thomas Whelan, is available on Github). I’ll show how instruction-level profiling makes it easier to apply advanced optimizations, helping speed up the example kernel by 2.7X on an NVIDIA Quadro M6000 GPU.

Profiling the kernel using the Guided Analysis feature of the Visual Profiler showed that the kernel performance was bound by instruction and memory latency. Latency issues indicate that the hardware resources are not used efficiently since most warps are stalled by a dependency on a data value from a previous math or memory instruction. Figure 1 shows that the compute units are only 40% utilized and memory units are around 25% utilized, so there is definitely room for improvement.

Figure 1 Kernel Performance Limiter (Bound by instruction and memory latency) .
Figure 1 Kernel Performance Limiter (Bound by instruction and memory latency).

Stall Analysis in Previous Profiler Versions

Before CUDA 7.5, the Visual Profiler was only capable of pointing out performance issues at the application or CUDA kernel level. For stall latency analysis, the CUDA 7.0 Visual Profiler produces the pie chart in Figure 2 by collecting various stall reason events for the entire kernel.

Figure 2 Legacy (CUDA 7.0) pie chart for stall reasons (generated using events collected at kernel level).
Figure 2 Legacy (CUDA 7.0) pie chart for stall reasons (generated using events collected at the kernel level).

This pie chart shows that the two primary stall reasons in this kernel are synchronization and memory dependencies. But if I look into the kernel code, there are lots of memory accesses and __syncthreads() calls, so this high-level analysis doesn’t provide any specific insight into which instructions are potential bottlenecks. In general it can be very difficult to find exact bottleneck causes in complex kernels using kernel-level profiling analysis. This is where CUDA 7.5 can help, as you’ll see. Continue reading