Cutting Edge Parallel Algorithms Research with CUDA

LeyuanLeyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two “Distinguished Papers” of the 51 accepted at Euro-Par 2015.  Euro-Par is a European conference devoted to all aspects of parallel and distributed processing held August 24-28 at Austria’s Vienna University of Technology.

Leyuan’s paper Fast Parallel Suffix Array on the GPU, co-authored by her advisor John Owens and Sean Baxter, a research scientist at New York’s DE Shaw Research, details their efforts to implement a linear-time suffix array construction algorithm on NVIDIA GPUs, resulting in algorithmic improvements and significant speedups over the existing state of the art.

Wang completed her master’s degree in electrical and computer engineering at UC Davis in October 2014, after having earned her undergraduate degree in electronics science and technology at China’s Zhejiang University.

Brad: Can you talk a bit about your current research?

Leyuan Wang: I work on high-performance string processing and graph processing algorithms, mostly in string and graph queries. My current research focus is on GPGPU (general-purpose computing on graphics processing units) and the benchmark I care about most is speed. I’ve been working on designing and improving parallel suffix array construction algorithms (SACAs) and incorporating the implementations in a Burrows-Wheeler transform-based lossless data compression (bzip2) and a parallel FM index for pattern searching. The suffix array (SA) of a string is the sorted set of all suffixes of the string. The inverse suffix array (ISA) is also the lexicographic ranks of suffixes.

The Burrows-Wheeler transform (BWT) of a string is generated by lexicographically sorting the cyclic shift of the string to form a string matrix and taking the last column of the matrix. The BWT groups repeated characters together by permuting the string; it is also reversible, which means the original string can be recovered. These two characteristics make BWT a popular choice for a compression pipeline stage (for instance, bzip2). It is directly related to the suffix array: the sorted rows in the matrix are essentially the sorted suffixes of the string and the first column of the matrix reflects a suffix array. Table 1 shows an example of the SA, ISA and BWT of the input string “banana$”

Table 1: SA, ISA and BWT for the example string “banana$”.
Table 1: SA, ISA and BWT for the example string “banana$”.

The suffix array data structure is a building block in a spectrum of applications, including data compression, bioinformatics, text indexing, etc. I’ve studied the taxonomy of all classes of SACAs and compared them in order to find the best candidate for the GPU. I revisited the previous conclusion that skew SACAs are best suited on the GPU by demonstrating that prefix-doubling SACAs are actually better both in theoretical analysis and experimental benchmarks. Our hybrid skew/prefix-doubling suffix array implementation (with our amazing research collaborator Sean Baxter, formerly of NVIDIA Research) using a Tesla K20 achieves a 7.9x speedup against the previous state-of-the-art skew implementation. Our optimized skew SACA implementation has been added as a primitive to CUDPP 2.2 (CUDA Data Parallel Primitives Library) and incorporated into the BWT and bzip2 data compression application, resulting in great speedups compared with bzip2 in CUDPP 2.1. Figure 1 shows pseudocode for our two approaches. Continue reading


Accelerating Materials Discovery with CUDA

In this post, we discuss how CUDA has facilitated materials research in the Department of Chemical and Biomolecular Engineering at UC Berkeley and Lawrence Berkeley National Laboratory. This post is a collaboration between Cory Simon, Jihan Kim, Richard L. Martin, Maciej Haranczyk, and Berend Smit.

Engineering Applications of Nanoporous Materials

Figure 1: The repeating crystal structure of metal-organic framework IRMOF-1. Atom color dictionary = {carbon: gray, oxygen: red, zinc: blue, hydrogen: white}.
Figure 1: The repeating crystal structure of metal-organic framework IRMOF-1. Atom color dictionary = {carbon: gray, oxygen: red, zinc: blue, hydrogen: white}.

Nanoporous materials have nano-sized pores such that only a few molecules can fit inside. Figure 1 shows the chemical structure of metal-organic framework IRMOF-1, just one of the many thousands of nanoporous materials that have been synthesized.

Nanoporous materials have many potential engineering applications based on gas adsorption: the process by which gas molecules adhere to a surface. In this case, the walls of the material’s pores form the surface to which gas molecules stick. Figure 2 shows the unit cell of the IRMOF-1 crystal structure and the corresponding depiction of IRMOF-1 as a raveled-up surface.

If we could unravel and flatten out the surface of IRMOF-1 in Figure 2, the surface area contained in a single gram of it could cover more than a soccer field! This provides a lot of surface area on which gas molecules can adsorb. These high surface areas are part of the reason that nanoporous materials are so promising for many engineering applications.

Figure 2: A nanoporous material can be abstracted as a raveled-up surface. On the left is the unit cell of the IRMOF-1 crystal structure. On the right is a depiction of the surface that IRMOF-1 forms.
Figure 2: A nanoporous material can be abstracted as a raveled-up surface. On the left is the unit cell of the IRMOF-1 crystal structure. On the right is a depiction of the surface that IRMOF-1 forms.

Continue reading


Voting and Shuffling to Optimize Atomic Operations

2iSome years ago I started work on my first CUDA implementation of the Multiparticle Collision Dynamics (MPC) algorithm, a particle-in-cell code used to simulate hydrodynamic interactions between solvents and solutes. As part of this algorithm, a number of particle parameters are summed to calculate certain cell parameters. This was in the days of the Tesla GPU architecture (such as GT200 GPUs, Compute Capability 1.x), which had poor atomic operation performance. A linked list approach I developed worked well on Tesla and Fermi as an alternative to atomic adds but performed poorly on Kepler GPUs. However, atomic operations are much faster on the Kepler and Maxwell architectures, so it makes sense to use atomic adds.

These types of summations are not limited to MPC or particle-in-cell codes, but, to some extent, occur whenever data elements are aggregated by key. For data elements sorted and combined by key with a large number of possible values, pre-combining elements with the same key at warp level can lead to a significant speed-up. In this post, I will describe algorithms for speeding up your summations (or similar aggregations) for problems with a large number of keys where there is a reasonable correlation between the thread index and the key. This is usually the case for elements that are at least partially sorted. Unfortunately, this argument works in both directions: these algorithms are not for you if your number of keys is small or your distribution of keys is random.  To clarify: by a “large” number of keys I mean more than could be handled if all bins were put into shared memory.

Note that this technique is related to a previously posted technique called warp-aggregated atomics by Andrey Adinetz, and also to the post Fast Histograms Using Shared Atomics on Maxwell by Nikolay Sakharnykh. The main difference here is that we are aggregating many groups, each designated by a key (to compute a histogram, for example). So you could consider this technique “warp-aggregated atomic reduction by key”. Continue reading

CUDA 7.5

New Features in CUDA 7.5

Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details.

16-bit Floating Point (FP16) Data

CUDA 7.5 expands support for 16-bit floating point (FP16) data storage and arithmetic, adding new half and half2 datatypes and intrinsic functions for operating on them. 16-bit “half-precision” floating point types are useful in applications that can process larger datasets or gain performance by choosing to store and operate on lower-precision data. Some large neural network models, for example, may be constrained by available GPU memory; and some signal processing kernels (such as FFTs) are bound by memory bandwidth.

Many applications can benefit by storing data in half precision, and processing it in 32-bit (single) precision. At GTC 2015 in March, NVIDIA CEO Jen-Hsun Huang announced that future Pascal architecture GPUs will include full support for such “mixed precision” computation, with FP16 (half) computation at higher throughput than FP32 (single) or FP64 (double) .

With CUDA 7.5, applications can benefit by storing up to 2x larger models in GPU memory. Applications that are bottlenecked by memory bandwidth may get up to 2x speedup. And applications on Tegra X1 GPUs bottlenecked by FP32 computation may benefit from 2x faster computation on half2 data.

CUDA 7.5 provides 3 main FP16 features: Continue reading

BIDMach: Machine Learning at the Limit with GPUs

Deep learning has made enormous leaps forward thanks to GPU hardware. But much Big Data analysis is still done with classical methods on sparse data. Tasks like click prediction, personalization, recommendation, search ranking, etc. still account for most of the revenue from commercial data analysis. The role of GPUs in that realm has been less clear. In the BIDMach project (part of the BID Data Project at UC Berkeley), we have been exploring general machine learning with GPUs. The results are remarkable: not only do we see order-of-magnitude speedups for most problems, but our system also outperforms today’s cluster computing systems running up to several hundred nodes on typical workloads. As well as the incentives to adopt GPU technology for deep learning tasks, there is now a strong incentive for organizations to migrate to GPUs for the remainder of their analytics workload.

Roofline Design

To build the fastest system, we borrowed the approach of roofline design from computer architecture. Roofline design involves designing to fundamental limits (e.g. ALU throughput, memory speed, network speed, I/O speed etc). A rooflined system is fast, and no other system can be much faster, since both have to respect the same hardware limits. A roofline diagram for matrix multiply is shown below:

Figure 1: CPU and GPU roofline limits
Figure 1: CPU and GPU roofline limits

The y-axis shows the potential throughput in arithmetic operations/second. The x-axis is “operational intensity” which is the number of operations applied to each data value (in units of operations per byte). The intensity is much lower for sparse operations – e.g. sparse matrix multiply typically involves only a multiply and add for each input datum, while dense matrix multiply uses each datum many times. The horizontal lines reflect the maximum ALU throughput for each type of processor (the graph is drawn for Intel i7 and NVIDIA GeForce GTX 680 processors). GPUs have much higher ALU throughput since the GPU chip area is almost entirely ALU. For dense matrix multiply, GPUs are10x faster thanks to this higher computing area.

The diagonal lines reflect memory bandwidth. Since bandwidth is flow in bytes/second it defines a linear relationship between the x-axis (flops/byte) and the y-axis (in flops/sec). A less well-known feature of GPUs is their higher main-memory bandwidth. This leads to a (potential) 10x gap in sparse matrix operations, which are the most important for many machine learning tasks. CuSparse achieves this ceiling for typical scientific data, but we found it was less well fit to very sparse data (text, web logs etc.). We wrote our own sparse kernels and were able to get them close to the roofline limits over a full range of sparseness. These kernels form the basis for the high throughput in most of BIDMach’s algorithms. Continue reading


CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More

It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs.

Recently NVIDIA released the CUDA Toolkit version 5.5 with support for the IBM POWER architecture. Starting with CUDA 7, all future CUDA Toolkit releases will support POWER CPUs.

CUDA 7 is a huge update to the CUDA platform; there are too many new features and improvements to describe in one blog post, so I’ll touch on some of the most significant ones today. Please refer to the CUDA 7 release notes and documentation for more information. We’ll be covering many of these features in greater detail in future Parallel Forall posts, so check back often!

Support for Powerful C++11 Features

C++11 is a major update to the popular C++ language standard. C++11 includes a long list of new features for simpler, more expressive C++ programming with fewer errors and higher performance. I think Bjarne Stroustrup, the creator of C++, put it best:

C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.
Continue reading


3 Versatile OpenACC Interoperability Techniques

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. An often-overlooked feature of OpenACC is its ability to interoperate with the broader parallel programming ecosystem. In this post I’ll teach you 3 powerful interoperability techniques for combining OpenACC and CUDA: the host_data construct, the deviceptr clause, and the acc_map_data() API function.

OpenACC InteropI’ll demonstrate these techniques with several examples of mixing OpenACC with CUDA C++, CUDA Fortran, Thrust, and GPU-accelerated libraries. If you’d like to follow along at home, grab the source code for the examples from Github and try them out with your OpenACC compiler and the CUDA Toolkit. Don’t have an OpenACC compiler? You can download a free 30-day trial of the PGI accelerator compiler.

You may already be thinking to yourself, “If OpenACC is so great, why would I need to use it with CUDA?” OpenACC interoperability features open the door to the GPU-computing ecosystem, allowing you to leverage more than 10 years of code development. Need to multiply two matrices together? Don’t write your own function, just call the cuBLAS library, which has been heavily optimized for GPUs. Does your colleague already have a CUDA routine that you could use in your code? Use it! Interoperability means that you can always use the best tool for the job in any situation. Accelerate your application using OpenACC, but call an optimized library. Expand an existing CUDA application by adding OpenACC to unaccelerated routines. Your choice isn’t OpenACC or CUDA, it’s OpenACC and CUDA. Continue reading