# High-Performance Geometric Multi-Grid with GPU Acceleration

Linear solvers are probably the most common tool in scientific computing applications. There are two basic classes of methods that can be used to solve an $Ax=b$ equation: direct and iterative. Direct methods are usually robust, but have additional computational complexity and memory capacity requirements. Unlike direct solvers, iterative solvers require minimal memory overhead and feature better computational complexity. However, these solvers are still super-linear in the number of variables and often have a slow rate of convergence of low-frequency errors. Finally, there are multi-grid iterative methods that can deliver linear complexity by solving a problem at different resolutions and smoothing low frequency errors using coarser grids.

Broadly speaking, multi-grid methods can be differentiated into more general algebraic multi-grid (AMG) and the specialized geometric multi-grid (GMG). AMG is a perfect “black-box” solver for problems with unstructured meshes, where elements or volumes can have different numbers of neighbors, and it is difficult to identify a subproblem. There is an interesting blog post demonstrating that GPU accelerators show good performance in AMG using the NVIDIA AmgX library. GMG methods are more efficient than AMG on structured problems, since they can take advantage of the additional information from the geometric representation of the problem. GMG solvers have significantly lower memory requirements, deliver higher computational throughput and also show good scalability. Moreover, these methods require less tuning in general and have a simpler setup than AMG. Let’s take a closer look at GMG and see how well it maps to GPU accelerators. Continue reading

# Optimizing Warehouse Operations with Machine Learning on GPUs

Recent advances in deep learning have enabled research and industry to master many challenges in computer vision and natural language processing that were out of  reach until just a few years ago. Yet computer vision and natural language processing represent only the tip of the iceberg of what is possible. In this article, I will demonstrate how Sebastian Heinz,
Roland Vollgraf and I (Calvin Seward) used deep neural networks in steering operations at Zalando’s fashion warehouses.

As Europe’s leading online fashion retailer, there are many exciting opportunities to apply the latest results from data science, statistics, and high-performance computing. Zalando’s vertically integrated business model means that I have dealt with projects as diverse as computer vision, fraud detection, recommender systems and, of course, warehouse management.

To solve the warehouse management problem that I’ll discuss in this post, we trained a neural network that very accurately estimates the length of the shortest possible route that visits a set of locations in the warehouse. I’ll demonstrate how we used this neural network to greatly accelerate a processing bottleneck, which in turn enabled us to more efficiently split work between workers.

The core idea is to use deep learning to create a fast, efficient estimator for a slow and complex algorithm. This is an idea that can (and will) be applied to problems in many areas of industry and research. Continue reading

# Inference: The Next Step in GPU-Accelerated Deep Learning

At 45 images/s/W, Jetson TX1 is super efficient at deep learning inference. Read the whitepaper.
Deep learning is revolutionizing many areas of machine perception, with the potential to impact the everyday experience of people everywhere. On a high level, working with deep neural networks is a two-stage process: First, a neural network is trained: its parameters are determined using labeled examples of inputs and desired output. Then, the network is deployed to run inference, using its previously trained parameters to classify, recognize and process unknown inputs.

It is widely recognized within academia and industry that GPUs are the state of the art in training deep neural networks, due to both speed and energy efficiency advantages compared to more traditional CPU-based platforms. A new whitepaper from NVIDIA takes the next step and investigates GPU performance and energy efficiency for deep learning inference.

The results show that GPUs provide state-of-the-art inference performance and energy efficiency, making them the platform of choice for anyone wanting to deploy a trained neural network in the field. In particular, the NVIDIA GeForce GTX Titan X delivers between 5.3 and 6.7 times higher performance than the 16-core Intel Xeon E5 CPU while achieving 3.6 to 4.4 times higher energy efficiency. The NVIDIA Tegra X1 SoC also achieves impressive results, achieving higher performance (258 vs. 242 images/second) and much higher energy efficiency (45 vs. 3.9 images/second/Watt) than the state-of-the-art Intel Core i7 6700K. Continue reading

# Customize CUDA Fortran Profiling with NVTX

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

Two previous Pro Tip posts on Parallel Forall showed how to use NVTX in CUDA C++ and MPI codes. In this post, I’ll show how to use NVTX to annotate the profiles of Fortran codes (with either CUDA Fortran or OpenACC).

NVTX has a lot of features, but here I’ll focus on using it to annotate the profiler output with timeline markers using `nvtxRangePush()` and `nvtxRangePop()`. I’ll show you how to insert markers with custom labels and colors. Continue reading

# Combine OpenACC and Unified Memory for Productivity and Performance

The post Getting Started with OpenACC covered four steps to progressively accelerate your code with OpenACC. It’s often necessary to use OpenACC directives to express both loop parallelism and data locality in order to get good performance with accelerators. After expressing available parallelism, excessive data movement generated by the compiler can be a bottleneck, and correcting this by adding data directives takes effort. Sometimes expressing proper data locality is more effort than expressing parallelism with loop directives.

Wouldn’t it be nice if programs could manage data locality automatically? Well, this is possible today with Unified Memory (on Kepler and newer GPU architectures). In this post I demonstrate how to combine OpenACC with Unified Memory to GPU-accelerate your existing applications with minimal effort. You can download the source code for the example in this post from the Parallel Forall GitHub repository.

## Jacobi Iteration with Heap Memory

I’ll use the popular Jacobi iteration example code which is representative of many real-world stencil computations. In contrast to the previous OpenACC post, I modified the array data allocation to use heap memory instead of using automatic stack-allocated arrays. This is a more common scenario for real applications since real-world data arrays are often too large for stack memory. This change also makes it a more challenging case for OpenACC since the compiler no longer knows the size of the arrays. The following excerpt shows the main loop of the Jacobi iteration with 2D index computation. Continue reading

# CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling

[Note: Thejaswi Rao also contributed to the code optimizations shown in this post.]

Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA Toolkit. One of the most exciting new features in CUDA 7.5 is new Instruction-Level Profiling support in the NVIDIA Visual Profiler. This powerful new feature, available on Maxwell (GM200) and later GPUs, helps pinpoint performance bottlenecks, letting you quickly identify the specific lines of source code (and assembly instructions) limiting the performance of GPU code, along with the underlying reason for execution stalls.

In this post, I demonstrate Instruction-Level Profiling by showing how it helped understand and improve the performance limitations of a CUDA kernel that implements the Iterative Closest Point algorithm (the original source code, by Thomas Whelan, is available on Github). I’ll show how instruction-level profiling makes it easier to apply advanced optimizations, helping speed up the example kernel by 2.7X on an NVIDIA Quadro M6000 GPU.

Profiling the kernel using the Guided Analysis feature of the Visual Profiler showed that the kernel performance was bound by instruction and memory latency. Latency issues indicate that the hardware resources are not used efficiently since most warps are stalled by a dependency on a data value from a previous math or memory instruction. Figure 1 shows that the compute units are only 40% utilized and memory units are around 25% utilized, so there is definitely room for improvement.

## Stall Analysis in Previous Profiler Versions

Before CUDA 7.5, the Visual Profiler was only capable of pointing out performance issues at the application or CUDA kernel level. For stall latency analysis, the CUDA 7.0 Visual Profiler produces the pie chart in Figure 2 by collecting various stall reason events for the entire kernel.

This pie chart shows that the two primary stall reasons in this kernel are synchronization and memory dependencies. But if I look into the kernel code, there are lots of memory accesses and `__syncthreads()` calls, so this high-level analysis doesn’t provide any specific insight into which instructions are potential bottlenecks. In general it can be very difficult to find exact bottleneck causes in complex kernels using kernel-level profiling analysis. This is where CUDA 7.5 can help, as you’ll see. Continue reading

# High-Performance MATLAB with GPU Acceleration

In this post, I will discuss techniques you can use to maximize the performance of your GPU-accelerated MATLAB® code. First I explain how to write MATLAB code which is inherently parallelizable. This technique, known as vectorization, benefits all your code whether or not it uses the GPU. Then I present a family of function wrappers—bsxfunpagefun, and arrayfun—that take advantage of GPU hardware, yet require no specialist parallel programming skills. The most advanced function, arrayfun, allows you to write your own custom kernels in the MATLAB language.

If these techniques do not provide the performance or flexibility you were after, you can still write custom CUDA code in C or C++ that you can run from MATLAB, as discussed in our earlier Parallel Forall posts on MATLAB CUDA Kernels and MEX functions.

All of the features described here are available out of the box with MATLAB and Parallel Computing Toolbox™.

## Mobile phone signal strength example

Throughout this post, I will use an example to help illustrate the techniques. A cellular phone network wants to map its coverage to help plan for new antenna installations. We imagine an idealized scenario with M = 25 cellphone masts, each H = 20 meters in height, evenly spaced on an undulating 10km x 10km terrain. Figure 1 shows what the map looks like.

On the GPU, in the following listing we define a number of variables including:

• map: An N x 3 height field in a 10km x 10km grid (N = 10,000);
• masts: An M x 3 array of antenna positions, at height H;
• AntennaDirection: A 3 x M array of vectors representing the orientation of each antenna.

# Voting and Shuffling to Optimize Atomic Operations

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of this algorithm, a number of particle parameters are summed to calculate certain cell parameters. This was in the days of the Tesla GPU architecture (such as GT200 GPUs, Compute Capability 1.x), which had poor atomic operation performance. A linked list approach I developed worked well on Tesla and Fermi as an alternative to atomic adds but performed poorly on Kepler GPUs. However, atomic operations are much faster on the Kepler and Maxwell architectures, so it makes sense to use atomic adds.

These types of summations are not limited to MPC or particle-in-cell codes, but, to some extent, occur whenever data elements are aggregated by key. For data elements sorted and combined by key with a large number of possible values, pre-combining elements with the same key at warp level can lead to a significant speed-up. In this post, I will describe algorithms for speeding up your summations (or similar aggregations) for problems with a large number of keys where there is a reasonable correlation between the thread index and the key. This is usually the case for elements that are at least partially sorted. Unfortunately, this argument works in both directions: these algorithms are not for you if your number of keys is small or your distribution of keys is random.  To clarify: by a “large” number of keys I mean more than could be handled if all bins were put into shared memory.

Note that this technique is related to a previously posted technique called warp-aggregated atomics by Andrey Adinetz, and also to the post Fast Histograms Using Shared Atomics on Maxwell by Nikolay Sakharnykh. The main difference here is that we are aggregating many groups, each designated by a key (to compute a histogram, for example). So you could consider this technique “warp-aggregated atomic reduction by key”. Continue reading

# Getting Started with OpenACC

This week NVIDIA has released the NVIDIA OpenACC Toolkit, a starting point for anyone interested in using OpenACC. OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing without significant programming effort. The toolkit includes the PGI OpenACC Compiler, the NVIDIA Visual Profiler with CPU and GPU profiling, and the new OpenACC Programming and Best Practices Guide. Academics can get a free renewable license to the PGI C,C++ and Fortran compilers with support for OpenACC.

OpenACC is an open specification for compiler directives for parallel programming. By using OpenACC, developers can rapidly accelerate existing C, C++, and Fortran applications using high-level directives that help retain application portability across processor architectures. Figure 1 shows some examples of real code speedups with OpenACC. The OpenACC specification is designed and maintained with the cooperation of many industry and academic partners, such as Cray, AMD, PathScale, University of Houston, Oak Ridge National Laboratory and NVIDIA.

When I program with and teach OpenACC I like to use a 4 step cycle to progressively accelerate the code.

1. Identify Parallelism: Profile the code to understand where the program is spending its time and how much parallelism is available to be accelerated in those important routines. GPUs excel when there’s a significant amount of parallelism to exploit, so look for loops and loop nests with a lot of independent iterations.
2. Express Parallelism: Placing OpenACC directives on the loops identified in step 1 tells the compiler to parallelize them. OpenACC is all about giving the compiler enough information to effectively accelerate the code, so during this step I add directives to as many loops as I believe I can and move as much of the computation to the GPU as possible.
3. Express Data Locality: The compiler needs to know not just what code to parallelize, but also which data will be needed on the accelerator by that code. After expressing available parallelism, I often find that the code has slowed down. As you’ll see later in this post, this slowdown comes from the compiler making cautious decisions about when data needs to be moved to the GPU for computation. During this step, I’ll express to the compiler my knowledge of when and how the data is really needed on the GPU.
4. Optimize – The compiler usually does a very good job accelerating code, but sometimes you can get more performance by giving the compiler a little more information about the loops or by restructuring the code to increase parallelism or improve data access patterns. Most of the time this leads to small improvements, but sometimes gains can be bigger.

# GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with respect to each other. [See the post How to Overlap Data Transfers in CUDA C/C++ for an example]

When you execute asynchronous CUDA commands without specifying a stream, the runtime uses the default stream. Before CUDA 7, the default stream is a special stream which implicitly synchronizes with all other streams on the device.

CUDA 7 introduces a ton of powerful new functionality, including a new option to use an independent default stream for every host thread, which avoids the serialization of the legacy default stream. In this post I’ll show you how this can simplify achieving concurrency between kernels and data copies in CUDA programs.