# Accelerating Bioinformatics with NVBIO

NVBIO is an open-source C++ template library of high performance parallel algorithms and containers designed by NVIDIA to accelerate sequence analysis and bioinformatics applications. NVBIO has a threefold focus:

1. Performance, providing a suite of state-of-the-art parallel algorithms that offer a significant leap in performance;
2. Reusability, providing a suite of highly expressive and flexible template algorithms that can be easily configured and adjusted to the many different usage scenarios typical in bioinformatics;
3. Portability, providing a completely cross-platform suite of tools, that can be easily switched from running on NVIDIA GPUs to multi-core CPUs by changing a single template parameter.

## Exponential Parallelism

We built NVBIO because we believe only the exponentially increasing parallelism of many-core GPU architectures can provide the immense computational capability required by the exponentially increasing sequencing throughput.

There is a common misconception that GPUs only excel at highly regular, floating point intensive applications, but today’s GPUs are fully programmable parallel processors, offering superior memory bandwidth and latency hiding characteristics, and R&D efforts at NVIDIA and elsewhere have proved that they can be a perfect match even for branchy, integer-heavy bioinformatics applications. The caveat is that legacy applications need to be rethought for fine-grained parallelism.

Many CPU algorithms are designed to run on few cores and scale to a tiny number of threads. When the number of threads is measured in the thousands, rather than dozens—a fact that all applications inevitably must consider—applications must tackle fundamental problems related to load balancing, synchronization, and execution and memory divergence.

NVBIO does just that, providing both low-level primitives that can be used from either CPU/host or GPU/device threads, as well as novel, highly parallel high-level primitives designed to scale from the ground up. Continue reading

# GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA

Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of factors. In this post I’ll cover several common scenarios ranging from fast static indexing to more complex and challenging use cases.

## Static indexing

Before discussing dynamic indexing let’s briefly look at static indexing. For small arrays where all indices are known constants at compile time, as in the following sample code, the compiler places all accessed elements of the array into registers.

__global__ void kernel1(float * buf)
{
float a[2];
...
float sum = a[0] + a[1];
...
}

This way array elements are accessed in the fastest way possible: math instructions use the data directly without loads and stores.

A slightly more complex (and probably more useful) case is an unrolled loop over the indices of the array. In the following code the compiler is also capable of assigning the accessed array elements to registers.

__global__ void kernel2(float * buf)
{
float a[5];
...
float sum = 0.0f;
#pragma unroll
for(int i = 0; i < 5; ++i)
sum += a[i];
...
}

Here we tell the compiler to unroll the loop with the directive #pragma unroll, effectively replacing the loop with all the iterations listed explicitly, as in the following snippet.

sum += a[0];
sum += a[1];
sum += a[2];
sum += a[3];
sum += a[4];

All the indices are now constants, so the compiler puts the whole array into registers. Continue reading

# CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

In this post, I’ll introduce warp-aggregated atomics, a useful technique to improve performance when many threads atomically add to a single counter. In warp aggregation, the threads of a warp first compute a total increment among themselves, and then elect a single thread to atomically add the increment to a global counter. This aggregation reduces the number of atomics performed by up to the number of threads in a warp (up to 32x on current GPUs), and can dramatically improve performance. Moreover, in many typical cases, you can implement warp aggregation as a drop-in replacement for standard atomic operations, so it is useful as a simple way to improve performance of complex applications.

## Problem: Filtering by a Predicate

Let’s consider the following problem, which we call filtering: we have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src satisfying the predicate into the destination array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not matter. For our example, we assume that the array elements are integers, and the predicate is true if and only if the element is positive. Here is a sample CPU implementation of filtering.

int filter(int *dst, const int *src, int n) {
int nres = 0;
for (int i = 0; i < n; i++)
if (src[i] > 0)
dst[nres++] = src[i];
// return the number of elements copied
return nres;
}

Filtering, also known as stream compaction, is a common operation, and it is a part of the standard libraries of many programming languages, where it goes under a variety of names, including grep, copy_if, select, FindAll and so on. It is also very often implemented simply as a loop, as it may be very tightly integrated with the surrounding code. Continue reading

# CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing

Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the input data is supplied as low-resolution samples from an 8-bit analog-to-digital (A/D) converter, the samples may first have to be expanded into 32-bit floating point numbers before the FFT and the rest of the processing pipeline can start.

The cuFFT library included with CUDA 6.5 introduces device callbacks to improve performance of this sort of transforms. Callback routines are user-supplied device functions that cuFFT calls when loading or storing data. You can use callbacks to implement many pre- or post-processing operations that required launching separate CUDA kernels before CUDA 6.5.

## Example DSP Pipeline

In this blog post we will implement the first stages of a typical DSP pipeline as depicted in Figure 1. We will first discuss a solution without callbacks using multiple custom kernels which we then use as a stepping stone towards a solution based on cuFFT device callbacks. The source code for both versions is available on github.

Batches of 8-bit fixed-point samples are input to the DSP pipline from an A/D converter. Each sample consists of 1024 data points. For more efficient processing, we group samples into batches of 1000 samples each. Therefore, you can think of this input as a 1000×1024 matrix of 8-bit fixed-point values. Continue reading

# CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.

Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.

CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.

# A CUDA Dynamic Parallelism Case Study: PANDA

This post concludes an introductory series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. The second post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. In this post, I finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment part of the (Facility for Antiproton and Ion Research in Europe (FAIR)). The PANDA work was carried out in the scope of the NVIDIA Application Lab at Jülich.

## The PANDA Experiment

PANDA (= anti-Proton ANnihilation at DArmstadt) is a state-of-the-art hadron particle physics experiment currently under construction at FAIR (Facility for Anti-proton and Ion Research) at Darmstadt. It is scheduled to start operation in 2019.

Inside the PANDA experiment, accelerated antiprotons will collide with protons, forming intermediate and unstable particles (mesons, baryons etc.), which will decay in cascades into stable particles, like electrons and photons. The unstable particles are of particular interest for PANDA, as they give insight into the processes governing this physics regime (QCD). Reconstructing all involved constituent particles of an event lets the physicists form a picture of the process, eventually confirming established physics theories, probing new ones and potentially finding exciting and unexpected results.

# CUDA Dynamic Parallelism API and Principles

This post is the second in a series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. This post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. My next post will finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment (Facility for Antiproton and Ion Research in Europe (FAIR)).

## Grid Nesting and Synchronization

In the CUDA programming model, a group of blocks of threads that are running a kernel is called a grid. In CUDA Dynamic Parallelism, a parent grid launches kernels called child grids. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size. Note that every thread that encounters a kernel launch executes it. Therefore, if the parent grid has 128 blocks with 64 threads each, and there is no control flow around a child kernel launch, then the grid will perform a total of 8192 kernel launches. If you want a kernel to only launch one child grid per thread block, you should launch the kernel from a single thread of each block as in the following code.

if(threadIdx.x == 0) {
child_k <<< (n + bs - 1) / bs, bs >>> ();
}

# Adaptive Parallel Computation with CUDA Dynamic Parallelism

Early CUDA programs had to conform to a flat, bulk parallel programming model. Programs had to perform a sequence of kernel launches, and for best performance each kernel had to expose enough parallelism to efficiently use the GPU. For applications consisting of “parallel for” loops the bulk parallel model is not too limiting, but some parallel patterns—such as nested parallelism—cannot be expressed so easily. Nested parallelism arises naturally in many applications, such as those using adaptive grids, which are often used in real-world applications to reduce computational complexity while capturing the relevant level of detail. Flat, bulk parallel applications have to use either a fine grid, and do unwanted computations, or use a coarse grid and lose finer details.

CUDA 5.0 introduced Dynamic Parallelism, which makes it possible to launch kernels from threads running on the device; threads can launch more threads. An application can launch a coarse-grained kernel which in turn launches finer-grained kernels to do work where needed. This avoids unwanted computations while capturing all interesting details, as Figure 1 shows.

Dynamic parallelism is generally useful for problems where nested parallelism cannot be avoided. This includes, but is not limited to, the following classes of algorithms:

• algorithms using hierarchical data structures, such as adaptive grids;
• algorithms using recursion, where each level of recursion has parallelism, such as quicksort;
• algorithms where work is naturally split into independent batches, where each batch involves complex parallel processing but cannot fully use a single GPU.

Dynamic parallelism is available in CUDA 5.0 and later on devices of Compute Capability 3.5 or higher (sm_35). (See NVIDIA GPU Compute Capabilities.)

This post introduces Dynamic Parallelism by example using a fast hierarchical algorithm for computing images of the Mandelbrot set.  This is the first of a three part series on CUDA Dynamic Parallelism:

# CUDA Pro Tip: Fast and Robust Computation of Givens Rotations

A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form

$G(i, j, \theta) = \begin{bmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & c & \cdots & -s & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & s & \cdots & c & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{bmatrix}$,

where the intersections of the $i$th and $j$th columns contain the values $c=cos \theta$ and $s=sin \theta$. Multiplying a vector $x$ by a Givens rotation matrix $G(i, j, \theta)$ represents a rotation of the vector $x$ in the $(i, j)$ plane by $\theta$ radians.

According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in vectors or matrices. Importantly, that means Givens rotations can be used to compute the QR decomposition of a matrix. An important advantage over Householder transformations is that Givens rotations are easy to parallelize. Continue reading

# Separate Compilation and Linking of CUDA C++ Device Code

This Post was Co-Written by Tony Scudiero and Mike Murphy.

Managing complexity in large programs requires breaking them down into components that are responsible for small, well-defined portions of the overall program. Separate compilation is an integral part of the C and C++ programming languages which allows portions of a program to be compiled into separate objects and then linked together to form an executable or library. Developing large and complex GPU programs is no different, and starting with CUDA 5.0, separate compilation and linking are now important tools in the repertoire of CUDA C/C++ programmers.

In this post, we explore separate compilation and linking of device code and highlight situations where it is helpful. In the process, we’ll walk through a simple example to show how device code linking can let you move existing code to the GPU with minimal changes to your class hierarchy and build infrastructure.

One of the key limitations that device code linking lifts is the need to have all the code for a GPU kernel present when compiling the kernel, including all the device functions that the kernel calls. As C++ programmers, we are used to calling externally defined functions simply by declaring the functions’ prototypes (or including a header that declares them).

## Managing Complexity with Separate Compilation

The common approach to organizing and building C++ applications is to define all member functions of a single class in one or more .cpp source files, and compile each .cpp source file into a separate .o object file. Other classes and functions may call these member functions from anywhere in the program by including the class header file; the function implementation is not needed to compile another function that calls it. After compiling all code, the linker connects calls to functions implemented in other files as part of the process of generating the executable.

Let’s imagine a very simple example application which has two classes: a particle class and a three-dimensional vector class, v3, that it uses. Our main task is moving the particle objects along randomized trajectories. Particle filters and Monte Carlo simulations frequently involve operations of this sort. We’ll use a CUDA C++ kernel in which each thread calls particle::advance() on a particle. Continue reading