# Parallel Direct Solvers with cuSOLVER: Batched QR

[Note: Lung Sheng Chien from NVIDIA also contributed to this post.]

A key bottleneck for most science and engineering simulations is the solution of sparse linear systems of equations, which can account for up to 95% of total simulation time. There are two types of solvers for these systems: iterative and direct solvers.  Iterative solvers are favored for the largest systems these days (see my earlier posts about AmgX), while direct solvers are useful for smaller systems because of their accuracy and robustness.

CUDA 7 expands the capabilities of GPU-accelerated numerical computing with cuSOLVER, a powerful new suite of direct linear system solvers.   These solvers provide highly accurate and robust solutions for smaller systems, and cuSOLVER offers a way of combining many small systems into a ‘batch’ and solving all of them in parallel, which is critical for the most complex simulations today.   Combustion models, bio-chemical models and advanced high-order finite-element models all benefit directly from this new capability.  Computer vision and object detection applications need to solve many least-squares problems, so they will also benefit from cuSOLVER.

Direct solvers rely on algebraic factorization of a matrix, which breaks a hard-to-solve matrix into two or more easy-to-solve factors, and a solver routine which uses the factors and a right hand side vector and solves them one at a time to give a highly accurate solution. Figure 1 shows an example of $LDL^T$ factorization of a dense matrix.   A solver for this factorization would first solve the transpose of L part, then apply the inverse of the D (diagonal) part in parallel, then solve again with L to arrive at the final answer. The benefit of direct solvers is that (unlike iterative solvers), they always find a solution (when the factors exist; more on this later) and once a factorization is found, solutions for many right-hand sides can be performed using the factors at a much lower cost per solution. Also, for small systems, direct solvers are typically faster than iterative methods because they only pass over the matrix once.

In this post I give an overview of cuSOLVER followed by an example of using batch QR factorization for solving many sparse systems in parallel. In a followup post I will cover other aspects of cuSOLVER, including dense system solvers and the cuSOLVER refactorization API.

# Accelerating Dissipative Particle Dynamics Simulation on Tesla GPUs

As you are probably aware, CUDA 7 was officially released during the 2015 GPU Technology Conference. For this Spotlight I took a few minutes to pick the brain of an early adopter of CUDA 7 to see how his work benefits from the new C++11 support.

I interviewed Yu-Hang Tang, a Ph.D. candidate in the Division of Applied Mathematics at Brown University in Providence, Rhode Island.

### What breakthrough project is currently taking up all of your brain’s time?

At this moment we are finalizing a particle-based simulator for the in silico investigation of microfluidic devices used in cancer diagnostic. The code enables us to predict the behavior of cancer cells as well as blood cells in various microfluidic channels. It could significantly speed up the process of microfluidic device design, which is usually time-consuming due to the large amount of trial-and-error experiments.

We will release the work by end of April and I will be happy to talk about more details by that time.

I started programming on the GeForce GTX 460 GPUs using OpenCL since 2010, and in 2012 I shifted entirely to CUDA C++.

Right now, I use mostly Kepler GPUs with high double-precision floating-point performance. I have been focused on accelerating particle-based simulations including All-Atom Molecular Dynamics (AAMD), Dissipative Particle Dynamics (DPD) and Smoothed Particle Hydrodynamics (SPH).

In fact, I have developed an entire GPU package (our USERMESO package), for the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) particle simulator for DPD and SPH simulations. The package achieves 20x to 30x speed up on a single K20 GPU over 16 AMD CPU cores on a Cray XK7 compute node.

### How has GPU computing impacted your research?

Our USERMESO package allows us to simulate DPD systems containing several millions of particles for millions of time steps on a daily basis during the study of the self-assembly behavior of amphiphilic polymers. The multi-compartment multi-walled vesicle, or simply think of it as a miniature cell, as Figure 1 shows, is only observable at a spatial-temporal scale that is tens of times larger, and tens of times longer than that covered by typical contemporary DPD simulations. With the USERMESO code we can perform such simulations daily with just 16 GPUs!

# Get Ready for the Low-Power Image Recognition Challenge with Jetson TK1

Image recognition and GPUs go hand-in-hand, particularly when using deep neural networks (DNNs). The strength of GPU-based DNNs for image recognition has been unequivocally demonstrated by their success over the past few years in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and DNNs have recently achieved classification accuracy on par with trained humans, as Figure 1 shows. The new Low-Power Image Recognition Challenge (LPIRC) highlights the importance of image recognition on mobile and embedded devices.

DNNs with convolutional layers are a biologically inspired artificial neural network. These networks may have five or more layers with many neurons in each layer. Links similar to synapses connect the layers, forwarding information to the next layer. The training process adjusts weights on the links, improving the network’s ability to classify the information presented to it. The more data used to train a DNN, the better its classification performance. This big data requirement has resulted in heavy GPU use, because GPUs are designed for high throughput on highly parallel computations like those used in deep learning.

ImageNet is a great resource for imagery, hosting a large database of images organized according to a hierarchy of descriptive nouns. Each year, ImageNet hosts the ILSVRC, for which entrants develop algorithms for accurately recognizing objects in the images. ImageNet provides a large image set of over 1.2 million images from 1000 different object categories for training recognition algorithms. Academic as well as industrial participants have performed strongly, with competitors from Google, Stanford University, University of California, Berkeley, and Adobe (among many others) in recent years.

## A Low-Power Challenge

To motivate improved image recognition on low-power devices, Yung-Hsiang Lu, Associate Professor of Electrical and Computer Engineering at Purdue University, and Alex Berg, Assistant Professor of Computer Science at UNC Chapel Hill, are organizing the Low-Power Image Recognition Challenge (LPIRC), a competition focused on identifying the best technology in both image recognition and energy conservation. Registration for the LPIRC is now open.

Achieving high performance while maintaining low power can be challenging, as these two parameters often increase together. Last year NVIDIA released the Jetson TK1 Development Kit, a low-power GPU-accelerated computing platform that is well-suited for image processing and computer vision applications. Jetson TK1’s low power requirements and image processing capabilities will make it a popular platform for LPIRC competitors. Continue reading

# Porting Scientific Applications to GPUs at the OLCF OpenACC Hackathon

Six scientific computing teams from around the world spent an intense week late last year porting their applications to GPUs using OpenACC directives. The Oak Ridge Leadership Computing Facility (OLCF) hosted its first ever OpenACC Hackathon in Knoxville, Tennessee. Paired with two GPU mentors, each team of scientific developers set forth on the journey to accelerate their code with GPUs.

Dr. Misun Min, a computational scientist at Argonne National Laboratory, led the NekCEM Team and she shared the results of accelerating NekCEM with OpenACC and NVIDIA GPUDirect™ communication.

### Who were the NekCEM hackathon team members, and how much GPU computing experience did your team have?

I have only six months experience; but at the time of the Hackathon, I didn’t really have any. The other members included Matthew Otten from Cornell University (six months GPU computing experience), Jing Gong from KTH in Sweden (two years of OpenACC experience), and Azamat Mametjanov from Argonne. The team also had close interactions with Nek5000 developer Paul Fischer at UIUC for useful discussions.

Two mentors from Cray Inc.: Aaron Vose and John Levesque. Aaron and John provided strong technical support to boost the performance of a GPU-enabled NekCEM version.

### First, what is NekCEM?

NekCEM (Nekton for Computational ElectroMagnetics) is an open-source code designed for predictive modeling of electromagnetic systems, such as linear accelerators, semiconductors, plasmonic devices, and quantum systems described by the Maxwell, Helmholtz, drift-diffusion, and Schrödinger or density matrix equations. The code is based on high-order discretizations of the underlying partial differential equations using spectral element (SE) and spectral-element discontinuous Galerkin (SEDG) schemes that have been shown to require order-of-magnitude fewer grid points than do conventional low-order schemes for the same accuracy. NekCEM uses globally unstructured meshes comprising body-fitted curvilinear hexahedral elements, which allow the discrete operators to be expressed as matrix-matrix products applied to arrays of the tensor product basis of Lagrange interpolation polynomials on the Gauss-Lobatto-Legendre quadrature points. The tight coupling of the degrees of freedom within elements leads to efficient data reuse while requiring boundary-minimal (unit-depth-stencil) data communication to effect flux exchanges between neighboring elements.

### What were your team’s goals going into the OpenACC Hackathon?

The team had two goals: (1) to develop a high-performance GPU-based operational variant of NekCEM that supports the full functionality of the existing CPU-only code in Fortran/C and (2) to perform analysis to find performance bottlenecks and infer potential scalability for GPU-based architectures of the future. Continue reading

# cuDNN v2: Higher Performance for Deep Learning on GPUs

The cuDNN library team is excited to announce the second version of cuDNN, NVIDIA’s library of GPU-accelerated primitives for deep neural networks (DNNs). We are proud that the cuDNN library has seen broad adoption by the deep learning research community and is now integrated into major deep learning toolkits such as CAFFE, Theano and Torch. While cuDNN was conceived with developers of deep learning toolkits and systems in mind, this release is all about features and performance for the deep learning practitioner. Before we get into those details though, let’s provide some context.

## Deep Learning for Big Data

Data science and machine learning have been growing rapidly in importance in recent years, along with the volume of “big data”. Machine learning provides techniques for developing systems that can automatically recognize, categorize, locate or filter the torrent of big data that flows endlessly into corporate servers (and our email inboxes). Deep neural networks (DNNs) have become an especially successful and popular technique, because DNNs are relatively straightforward to implement and scale well—the more data you throw at them the better they perform. Most importantly, DNNs are now established as the most accurate technique across a range of problems, including image classification, object detection, and text and speech recognition. In fact, research teams from Microsoft, Google and Baidu have recently shown DNNs that perform better on an image recognition task than a trained human observer!

Deep learning and machine learning have been popular topics on Parallel Forall recently, so here are some pointers to excellent recent posts for more information. The original cuDNN announcement post provides an introduction to machine learning, deep learning and cuDNN. There are excellent posts on using cuDNN with Caffe for computer vision, with Torch for natural language understanding, on how Baidu uses cuDNN for speech recognition, and on embedded deep learning on Jetson TK1. There is also a recent post about BIDMach, an accelerated framework for machine learning techniques that are not neural network-based (SVMs, K-means, linear regression and so on). Continue reading

# C++11 in CUDA: Variadic Templates

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with `nvcc`, but also in device code. In my post “The Power of C++11 in CUDA 7” I covered some of the major new features of C++11, such as lambda functions, range-based for loops, and automatic type deduction (`auto`). In this post, I’ll cover variadic templates.

There are times when you need to write functions that take a variable number of arguments: variadic functions. To do this in a typesafe manner for polymorphic functions, you really need to take a variable number of types in a template. Before C++11, the only way to write variadic functions was with the ellipsis (`...`) syntax and the `va_*` facilities. These facilities did not enable type safety and can be difficult to use.

As an example, let’s say we want to abstract the launching of GPU kernels. In my case, I want to provide simpler launch semantics in the Hemi library. There are many cases where you don’t care to specify the number and size of thread blocks—you just want to run a kernel with “enough” threads to fully utilize the GPU, or to cover your data size. In that case we can let the library decide how to launch the kernel, simplifying our code. But to launch arbitrary kernels, we have to support arbitrary type signatures. Well, we can do that like this:

```template <typename... Arguments>
void cudaLaunch(const ExecutionPolicy &p,
void(*f)(Arguments...),
Arguments... args);
```

Here, `Arguments...` is a “type template parameter pack”. We can use it to refer to the type signature of our kernel function pointer `f`, and to the arguments of `cudaLaunch`. To do the same thing before C++11 (and CUDA 7) required providing multiple implementations of `cudaLaunch`, one for each number of arguments we wanted to support. That meant you had to limit the maximum number of arguments allowed, as well as the amount of code you had to maintain. In my experience this was prone to bugs. Here’s the implementation of `cudaLaunch`. Continue reading

# The Power of C++11 in CUDA 7

Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone!

CUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself here. Instead, I wanted to take a deeper look at C++11 support in device code.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with `nvcc`, but also in device code. New C++ language features include `auto`, lambda functions, variadic templates, `static_assert`, rvalue references, range-based for loops, and more. To enable C++11 support, pass the flag `--std=c++11` to `nvcc` (this option is not required for Microsoft Visual Studio).

In my earlier CUDA 7 feature overview post, I presented a small example to show some C++11 features. Let’s dive into a somewhat expanded example to show the power of C++11 for CUDA programmers. This example will proceed top-down, covering a couple of layers of abstraction that allow us to write concise, reusable C++ code for the GPU, all enabled by C++11. The complete example is available on Github.

Let’s say we have a very specific (albeit contrived) goal: count the number of characters from a certain set within a text. (In parallel, of course!) Here’s a simple CUDA C++11 kernel that abstracts the mechanics of this a bit.

```__global__
void xyzw_frequency(int *count, char *text, int n)
{
const char letters[] { 'x','y','z','w' };

count_if(count, text, n, [&](char c) {
for (const auto x : letters)
if (c == x) return true;
return false;
});
}
```

# DIGITS: Deep Learning GPU Training System

The hottest area in machine learning today is Deep Learning, which uses Deep Neural Networks (DNNs) to teach computers to detect recognizable concepts in data. Researchers and industry practitioners are using DNNs in image and video classification, computer vision, speech recognition, natural language processing, and audio recognition, among other applications.

The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training these large, complex DNNs, reducing training time from months to only a few days. The major deep learning software frameworks have incorporated GPU acceleration, including Caffe, Torch7, Theano, and CUDA-Convnet2. Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, last year NVIDIA introduced cuDNN, a library of primitives for deep neural networks.

Today at the GPU Technology Conference, NVIDIA CEO and co-founder Jen-Hsun Huang introduced DIGITS, the first interactive Deep Learning GPU Training System. DIGITS is a new system for developing, training and visualizing deep neural networks. It puts the power of deep learning into an intuitive browser-based interface, so that data scientists and researchers can quickly design the best DNN for their data using real-time network behavior visualization. DIGITS is open-source software, available on GitHub, so developers can extend or customize it or contribute to the project.

Deep Learning is an approach to training and employing multi-layered artificial neural networks to assist in or complete a task without human intervention. DNNs for image classification typically use a combination of convolutional neural network (CNN) layers and fully connected layers made up of artificial neurons tiled so that they respond to overlapping regions of the visual field. Continue reading

# GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let’s focus on one of the most common use cases: an image histogram using 256 bins for each color channel. Even though we’ll use a specific problem setup the same algorithms can benefit other computational domains as well.

A basic serial image histogram computation is relatively simple. For each pixel of the image and for each RGB color channel we find a corresponding integer bin from 0 to 255 and increment its value. Atomic operations are a natural way of implementing histograms on parallel architectures. Depending on the input distribution, some bins will be used much more than others, so it is necessary to support efficient accumulation of the values across the full memory hierarchy. This is similar to reduction and scan operations, but the main challenge with histograms is that the output location for each element is not known prior to reading its value. Therefore, it is impossible to create a generic parallel accumulation scheme that completely avoids collisions. Histograms are now much easier to handle on GPU architectures thanks to the improved atomics performance in Kepler and native support of shared memory atomics in Maxwell.

Our histogram implementation has two phases and two corresponding CUDA C++ kernels, as Figure 1 shows. In the first phase each CUDA thread block processes a region of the image and accumulates a corresponding local histogram, storing the local histogram in global memory at the end of the phase. The second kernel accumulates all per-block histograms into the final histogram stored in global memory. The work separation between blocks in the first phase reduces contention when accumulating values into the same bin. Continue reading

# HPC Visualization on NVIDIA Tesla GPUs

HPC looks very different today than it did when I was a graduate student in the mid-90s. Today’s supercomputers are many orders of magnitude faster than the machines of the 90s, and GPUs have helped push arithmetic performance on several leading systems to stratospheric levels. Unfortunately, the arithmetic performance wrought by two decades of supercomputer design has created tremendous I/O and visualization challenges that must be overcome, reflected by the famous statement:

“A supercomputer is a device for turning compute-bound problems into
I/O-bound problems.” — Ken Batcher

## Molecular visualization with VMD

Since 1998, I’ve been leading the development of VMD, a popular molecular visualization and analysis application that is used by scientists all over the world. Among similar programs, VMD is particularly focused on capabilities that support large-scale molecular dynamics simulations and cellular modeling.

The movies in this article are examples of the kind of visualizations we regularly produce with parallel VMD visualization runs that use OptiX and/or OpenGL running on the Tesla GPUs in the Blue Waters and Titan supercomputers. These example movies highlight the science done by my colleagues in the Theoretical and Computational Biophysics Group, led by Prof. Klaus Schulten at U. Illinois.

Some key areas of our ongoing VMD development involve the continued adaptation of the program for petascale and exascale supercomputers, advancing the molecular visualization state-of-the-art with parallel and interactive ray tracing techniques, exploiting massively parallel GPU accelerators for both visualization and analysis tasks, and supporting remote visualization and collaboration on HPC platforms. The combination of these VMD development tracks and current technological progress in HPC, GPUs, and visualization algorithms is leading in a very exciting direction. Continue reading