CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

In this post, I’ll introduce warp-aggregated atomics, a useful technique to improve performance when many threads atomically add to a single counter. In warp aggregation, the threads of a warp first compute a total increment among themselves, and then elect a single thread to atomically add the increment to a global counter. This aggregation reduces the number of atomics performed by up to the number of threads in a warp (up to 32x on current GPUs), and can dramatically improve performance. Moreover, in many typical cases, you can implement warp aggregation as a drop-in replacement for standard atomic operations, so it is useful as a simple way to improve performance of complex applications.

Problem: Filtering by a Predicate

Let’s consider the following problem, which we call filtering: we have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src satisfying the predicate into the destination array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not matter. For our example, we assume that the array elements are integers, and the predicate is true if and only if the element is positive. Here is a sample CPU implementation of filtering.

int filter(int *dst, const int *src, int n) {
  int nres = 0;
  for (int i = 0; i < n; i++)
    if (src[i] > 0)
      dst[nres++] = src[i];
  // return the number of elements copied
  return nres;

Filtering, also known as stream compaction, is a common operation, and it is a part of the standard libraries of many programming languages, where it goes under a variety of names, including grep, copy_if, select, FindAll and so on. It is also very often implemented simply as a loop, as it may be very tightly integrated with the surrounding code. Continue reading


CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing

Digital signal processing (DSP) applications commonly transform input data before performing an FFT, or transform output data afterwards. For example, if the input data is supplied as low-resolution samples from an 8-bit analog-to-digital (A/D) converter, the samples may first have to be expanded into 32-bit floating point numbers before the FFT and the rest of the processing pipeline can start.

The cuFFT library included with CUDA 6.5 introduces device callbacks to improve performance of this sort of transforms. Callback routines are user-supplied device functions that cuFFT calls when loading or storing data. You can use callbacks to implement many pre- or post-processing operations that required launching separate CUDA kernels before CUDA 6.5.

Example DSP Pipeline

In this blog post we will implement the first stages of a typical DSP pipeline as depicted in Figure 1. We will first discuss a solution without callbacks using multiple custom kernels which we then use as a stepping stone towards a solution based on cuFFT device callbacks. The source code for both versions is available on github.

Figure 1: The processing pipeline for our example before and with CUDA 6.5 callbacks.

Batches of 8-bit fixed-point samples are input to the DSP pipline from an A/D converter. Each sample consists of 1024 data points. For more efficient processing, we group samples into batches of 1000 samples each. Therefore, you can think of this input as a 1000×1024 matrix of 8-bit fixed-point values. Continue reading


Maxwell: The Most Advanced CUDA GPU Ever Made

Today NVIDIA introduced the new GM204 GPU, based on the Maxwell architecture. GM204 is the first GPU based on second-generation Maxwell, the full realization of the Maxwell architecture. The GeForce GTX 980 and 970 GPUs introduced today are the most advanced gaming and graphics GPUs ever made. But of course they also make fantastic CUDA development GPUs, with full support for CUDA 6.5 and all of the latest features of the CUDA platform, including Unified Memory and Dynamic Parallelism.

GM204′s 16 SMs make it over 3 times faster than the first-generation GM107 GPU that I introduced earlier this year on Parallel Forall, and additional architectural improvements help GM204 pack an even bigger punch.

SMM: The Maxwell Multiprocessor

Figure 1: Maxwell’s Multiprocessor, SMM.

As I discussed in my earlier Maxwell post, the heart of Maxwell’s power-efficient performance is it’s Streaming Multiprocessor, known as SMM. Maxwell’s new datapath organization and improved instruction scheduler provide more than 40% higher delivered performance per CUDA core, and overall twice the efficiency of Kepler GK104. The new SMM, shown in Figure 1, includes all of the architectural benefits of its first-generation Maxwell predecessor, including improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and more.  

SMM uses a quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock. Each SMM provides eight texture units, one polymorph engine (geometry processing for graphics), and dedicated register file and shared memory.

Continue reading


CUDA Spotlight: Dr. Cris Cecka on GPU-Accelerated Computational Mathematics

Cris-Cecka-Harvard1Our Spotlight is on Dr. Cris Cecka, a research scientist and lecturer in the new Institute for Applied Computational Science (IACS) at Harvard University. Harvard has been a CUDA Center of Excellence since 2009, led by Dr. Hanspeter Pfister, IACS Director. Cris is currently also performing research with the Mathematics Department at the Massachusetts Institute of Technology. Previously, Cris was a graduate student in the Institute for Computational and Mathematical Engineering (ICME) at Stanford University with Prof. Eric Darve.

The following is an excerpt from our interview (read the complete Spotlight here).

NVIDIA: Cris, what are your primary research interests?
Cris: My research focuses on computational mathematics, particularly for interdisciplinary applications in science and engineering. In the past, I’ve used CUDA for non-linear PDEs (partial differential equations) and real-time computing with applications in simulation and virtual surgery.

More recently, I have become interested in mathematical and computational abstractions to produce efficient, library-quality scientific software. Specifically, I have focused on generalized n-body problems, including integral equation methods, particle methods, and structured dense matrices.

As part of my work, I’ve released several software libraries, including FMMTL to aid in the research, development, and use of kernel matrices and CrowdCL to aid in the use of GPU computing within a browser.

NVIDIA: Tell us more about FMMTL. Is it GPU-accelerated?

FMMTL Error Plot
FMMTL Error Plot

Cris: FMMTL is a research code that is exploring fast algorithms (like Treecode, FMM, H-matrix, and Butterfly) for kernel matrices and other structured dense matrices. Why structured? Well, plenty of algorithms exist for dense matrices, e.g. all of BLAS and LAPACK. These use values of the matrix to compute products, eigenvalues, factorizations, etc. But there are huge classes of problems where we never actually want to construct all of the elements of the matrix — generalized n-body problems — and can be accelerated either by compressing rows, columns, or blocks of the matrix or by avoiding computing elements of the matrix all-together.

By avoiding the computation of all of the elements or delaying the computation until the matrix element is requested, the amount of data required to define the matrix is reduced to O(N), which is great in terms of computational intensity! There is very little data to access and lots and lots of computation. Continue reading


Accelerate Machine Learning with the cuDNN Deep Neural Network Library

Machine Learning (ML) has its origins in the field of Artificial Intelligence, which started out decades ago with the lofty goals of creating a computer that could do any work a human can do.  While attaining that goal still appears to be in the distant future, many useful tools have been developed and successfully applied to a wide variety of problems.  In fact, ML has now become a pervasive technology, underlying many modern applications.  Today the world’s largest financial companies, internet firms and foremost research institutions are using ML in applications including internet search, fraud detection, gaming, face detection, image tagging, brain mapping, check processing and computer server health-monitoring, to name a few.  The US Postal Service uses machine learning techniques for hand-writing recognition, and leading applied-research government agencies such as IARPA and DARPA are funding work to develop the next generation of ML systems.

Figure 1: :  Schematic representation of a deep neural network, showing how more complex features are captured in deeper layers.
Figure 1: : Schematic representation of a deep neural network, showing how more complex features are captured in deeper layers.

There is a wide variety of algorithms and processes for implementing ML systems. The hottest area in ML today however, is the area of Deep Neural Networks (DNNs).  The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training large, complex DNN-based ML systems. Pioneers in this area include luminaries like Geoffrey Hinton, Yann LeCun, Yoshua Bengio, and Andrew Ng.  Their success over the past 30 years has inspired a groundswell of research and development in academia, including universities such as Carnegie Mellon, NYU, Oxford, Stanford, University of California at Berkeley, University of Montreal, and the University of Toronto. More recently, many commercial enterprises have also started investing aggressively in this technology.  A few that have publicly acknowledged using GPUs with deep learning include Adobe, Baidu, Nuance, and Yandex.

Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, NVIDIA is introducing a library of primitives for deep neural networks called cuDNN.  The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, and provides other important benefits.

Machine Learning with DNNs

A ML system may be thought of as a system that learns to recognize things of interest to us, without being told explicitly what the things are ahead of time. Classic examples of such a system are the spam classifier, which scans your incoming messages and quarantines spam emails, and product recommender systems which suggest new products (books, movies, etc.) that you might like based on your prior purchases and ratings. Continue reading


CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule that will help you avoid it within your own software (spoiler alert: it’s in the title!).

Let’s review how to select which GPU to execute CUDA calls on. The CUDA runtime API is state-based, and threads execute cudaSetDevice() to set the current GPU.

cudaError_t cudaSetDevice(int device)

After this call all CUDA API commands go to the current set device until cudaSetDevice() is called again with a different device ID. The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device. This is very important as it allows threads to concurrently submit work to different devices, but forgetting to set the current device in each thread can lead to subtle and hard-to-find bugs like the following example.


#pragma omp parallel

While at first glance this code may seem bug free, it is incorrect. Continue reading


3 Versatile OpenACC Interoperability Techniques

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. An often-overlooked feature of OpenACC is its ability to interoperate with the broader parallel programming ecosystem. In this post I’ll teach you 3 powerful interoperability techniques for combining OpenACC and CUDA: the host_data construct, the deviceptr clause, and the acc_map_data() API function.

OpenACC InteropI’ll demonstrate these techniques with several examples of mixing OpenACC with CUDA C++, CUDA Fortran, Thrust, and GPU-accelerated libraries. If you’d like to follow along at home, grab the source code for the examples from Github and try them out with your OpenACC compiler and the CUDA Toolkit. Don’t have an OpenACC compiler? You can download a free 30-day trial of the PGI accelerator compiler.

You may already be thinking to yourself, “If OpenACC is so great, why would I need to use it with CUDA?” OpenACC interoperability features open the door to the GPU-computing ecosystem, allowing you to leverage more than 10 years of code development. Need to multiply two matrices together? Don’t write your own function, just call the cuBLAS library, which has been heavily optimized for GPUs. Does your colleague already have a CUDA routine that you could use in your code? Use it! Interoperability means that you can always use the best tool for the job in any situation. Accelerate your application using OpenACC, but call an optimized library. Expand an existing CUDA application by adding OpenACC to unaccelerated routines. Your choice isn’t OpenACC or CUDA, it’s OpenACC and CUDA. Continue reading


CUDACasts Episode 20: Getting started with Jetson TK1 and OpenCV

TK1_Dev_Kit-6350-GREEN-V4_280The Jetson TK1 development kit has fast become a must-have for mobile and embedded parallel computing due the amazing level of performance packed into such a low-power board. In this and the following CUDACast, you’ll learn how to get started building computer vision applications on your Jetson TK1 using CUDA and the OpenCV library.

CUDACasts are short how-to screencast videos about new features and techniques for GPU programming. Click here for all CUDACasts.

Continue reading


Remote application development using NVIDIA® Nsight™ Eclipse Edition

NVIDIA® Nsight™ Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications for either your local (x86_64) system or a remote (x86_64 or ARM) target system. In my last post on remote development of CUDA applications, I covered NSEE’s cross compilation mode. In this post I will focus on the using NSEE’s synchronized project mode.

For remote development of CUDA applications using synchronized-project mode, you can edit code on the host system and synchronize it with the target system. In this scenario, the code is compiled natively on the target system as Figure 1 shows.

CUDA application development usage scenarios with Nsight Eclipse Edition
Figure 1: CUDA application development usage scenarios with Nsight Eclipse Edition

In synchronized project mode the host system does not need an ARM cross-compilation tool chain, so you have the flexibility to use Mac OS X or any of the CUDA supported x86_64 Linux platforms as the host system. The remote target system can be a CUDA-supported x86_64 Linux target or an ARM-based platform like the Jetson TK1 system. I am using Mac OS X 10.8.5 on my host system (with Xcode 5.1.1 installed) and 64-bit Ubuntu 12.04 on my target system. Continue reading


CUDA Spotlight: Michela Taufer on GPU-Accelerated Scientific Computing

TauferMichela_112112Our Spotlight is on Dr. Michela Taufer, Associate Professor at the University of Delaware.

Michela heads the Global Computing Lab (GCLab), which focuses on high performance computing (HPC) and its application to the sciences.

Her research interests include software applications and their advanced programmability in heterogeneous computing (i.e., multi-core platforms and GPUs); cloud computing and volunteer computing; and performance analysis, modeling and optimization of multi-scale applications.

The following is an excerpt from our interview (read the complete Spotlight here).

NVIDIA: Michela, what is the mission of the Global Computing Lab at the University of Delaware?
Michela: We are engaged in the design and testing of efficient computational algorithms and adaptive scheduling policies for scientific computing on GPUs, the Cloud, and Volunteer Computing.

Interdisciplinary research with scientists and engineers in fields such as chemistry and chemical engineering, pharmaceutical sciences, seismology, and mathematics is at the core of our activities and philosophy.

NVIDIA: Tell us about your work with GPUs.
Michela: My team’s work is all about rethinking application algorithms to fit on the GPU architecture in order to get the most out of its computing power, while preserving the scientific accuracy of the simulations. This has resulted in many exciting achievements!

NVIDIA: Can you provide an example?
Michela: My group and I were the first to propose a completely-on-GPU PME (Particle Mesh Ewald) code for MD (molecular dynamics) simulations. We achieved that goal by changing the traditional way researchers algorithmically look at charges in long-range electrostatics and their interactions.

With our code empowered with the PME components, we could move the traditional scale for studying membranes like DMPC lipid bilayers from membranes on the order of 72 lipid molecules (17,004 atoms) to 16-times larger membranes of 1,152 lipid molecules (27,3936 atoms) in explicit solvent [see Figure 1].

Figure 1: Visual representations of the lipid-bilayer systems. The DMPC 1x1 system describes the small system of 72 lipid molecules (36 lipids/leaflet) traditionally used for simulations on high-end clusters. DMPC 2x2 and 4x4 describe systems with 288 and 1152 lipid molecules, respectively, that we were able to study on a single GPU. Presented in Structural, Dynamic, and Electrostatic Properties of Fully Hydrated DMPC Bilayers from Molecular Dynamics Simulations Accelerated with GPUs.
Figure 1: Visual representations of the lipid-bilayer systems.
The DMPC 1×1 system describes the small system of 72 lipid molecules (36 lipids/leaflet) traditionally used for simulations on high-end clusters. DMPC 2×2 and 4×4 describe systems with 288 and 1152 lipid molecules, respectively, that we were able to study on a single GPU. Presented in Structural, Dynamic, and Electrostatic Properties of Fully Hydrated DMPC Bilayers from Molecular Dynamics Simulations Accelerated with GPUs.

Continue reading