cuDNN v2: Higher Performance for Deep Learning on GPUs

The cuDNN library team is excited to announce the second version of cuDNN, NVIDIA’s library of GPU-accelerated primitives for deep neural networks (DNNs). We are proud that the cuDNN library has seen broad adoption by the deep learning research community and is now integrated into major deep learning toolkits such as CAFFE, Theano and Torch. While cuDNN was conceived with developers of deep learning toolkits and systems in mind, this release is all about features and performance for the deep learning practitioner. Before we get into those details though, let’s provide some context.

Deep Learning for Big Data

Data science and machine learning have been growing rapidly in importance in recent years, along with the volume of “big data”. Machine learning provides techniques for developing systems that can automatically recognize, categorize, locate or filter the torrent of big data that flows endlessly into corporate servers (and our email inboxes). Deep neural networks (DNNs) have become an especially successful and popular technique, because DNNs are relatively straightforward to implement and scale well—the more data you throw at them the better they perform. Most importantly, DNNs are now established as the most accurate technique across a range of problems, including image classification, object detection, and text and speech recognition. In fact, research teams from Microsoft, Google and Baidu have recently shown DNNs that perform better on an image recognition task than a trained human observer!

Deep learning and machine learning have been popular topics on Parallel Forall recently, so here are some pointers to excellent recent posts for more information. The original cuDNN announcement post provides an introduction to machine learning, deep learning and cuDNN. There are excellent posts on using cuDNN with Caffe for computer vision, with Torch for natural language understanding, on how Baidu uses cuDNN for speech recognition, and on embedded deep learning on Jetson TK1. There is also a recent post about BIDMach, an accelerated framework for machine learning techniques that are not neural network-based (SVMs, K-means, linear regression and so on).

cuDNN v2: Performance for Deep Learning Practioners

The primary goal of cuDNN v2 is to improve performance and provide the fastest possible routines for training (and deploying) deep neural networks for practitioners. This release significantly improves the performance of many routines, especially convolutions. In Figure 1, you can see that cuDNN v2 is nearly 20 times faster than a modern CPU at training large deep neural networks!  Figure 1 compares speedup (relative to Caffe running on a 16-core Intel Haswell CPU) on three well-known neural network architectures: Alexnet, Caffenet and GoogLeNet. The grey bar shows the speedup of the native (legacy) Caffe GPU implementation, and the green bar shows the speedup obtained with cuDNN v2.  Note that the speedup obtained with cuDNN v2 is now 80% higher than with the legacy Caffe GPU implementation.

Figure 1: cuDNN performance comparison with CAFFE, using several well known networks. CPU is 16-core Intel Haswell E5-2698 2.3 GHz with 3.6 GHz Turbo. GPU is NVIDIA GeForce GTX TITAN X.

cuDNN v2 now allows precise control over the balance between performance and memory footprint.  Specifically, cuDNN allows an application to explicitly select one of four algorithms for forward convolution, or to specify a strategy by which the library should automatically select the best algorithm. Available strategies include “prefer fastest” and “use no additional working space”. The four forward convolution algorithms are IMPLICIT_GEMM, IMPLICIT_PRECOMP_GEMM, GEMM and DIRECT.

IMPLICIT_GEMM is the algorithm used in cuDNN v1. It is an in-place computation, and the only algorithm that supports all input sizes and configurations while using no extra working space. If your goal is to fit the largest possible neural network model into the memory of your GPU this is the recommended option.

The IMPLICIT_PRECOMP_GEMM algorithm is a modification of the IMPLICIT_GEMM approach, which uses a small amount of working space (see the Release Notes for details on how much) to achieve significantly higher performance than the original IMPLICIT_GEMM for many use cases.

The GEMM algorithm is an “im2col” approach, which explicitly expands the input data in memory and then uses a pure matrix multiplication. This algorithm requires significant working space, but in some cases it is the fastest approach. If you tell cuDNN to “prefer fastest”, it will sometimes choose this approach. You can use the SPECIFY_WORKSPACE_LIMIT instead of PREFER_FASTEST to ensure that the algorithm cuDNN chooses will not require more than a given amount of working space.

The DIRECT option is currently not implemented, so it is really just a placeholder. In a future version of cuDNN this will specify the usage of a direct convolution implementation. We will have guidelines on how this approach compares to the others when it is made available.

More Features and Capabilities for Users

Besides performance, there are other new features and capabilities in cuDNN v2 aimed at helping deep learning practitioners get the most out of their systems as easily as possible.

The cuDNN interface has been generalized to support data sets with other than two spatial dimensions (for example, 1D and 3D data). In fact, cuDNN now allows arbitrary N-dimensional tensors. This is a forward-looking change; most routines remain limited to two spatial dimensions. As a beta feature in this release, there is now support for 3D datasets (see the Release Notes for details). The cuDNN team is looking for community feedback on the importance of higher dimensional support.

Other new features include OS X support, zero-padding of borders in pooling routines (similar to what was already provided for convolutions), parameter scaling and improved support for arbitrary strides. A number of issues identified in cuDNN v1 have been resolved. cuDNN v2 will support the forthcoming Tegra X1 processor via PTX JIT compilation as well.  Please see the cuDNN Release Notes for full details on all of these important developments!

Important API Changes

Several of the improvements described above required changes to the cuDNN API. Therefore, cuDNN v2 is not a drop-in version upgrade. Applications previously using cuDNN v1 are likely to need minor changes for API compatibility with cuDNN v2. Note that the Im2Col function is exposed as a public function in cuDNN v2, but it is intended for internal use only, and it will likely be removed from the public API in the next version.

cuDNN is still less than one year old. We expect cuDNN to mature rapidly, making API changes rare in the future. The cuDNN library team genuinely appreciates all feedback from the deep learning community, and carefully considers any API change.

Try cuDNN yourself!

cuDNN is free for anyone to use for any purpose: academic, research or commercial. Just sign up for a registered CUDA developer account.  Once your account is activated, log in and you will see a link to the cuDNN download page.  You will likely want to start by reading the included User Guide. Get started with cuDNN today!


C++11 in CUDA: Variadic Templates

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. In my post “The Power of C++11 in CUDA 7” I covered some of the major new features of C++11, such as lambda functions, range-based for loops, and automatic type deduction (auto). In this post, I’ll cover variadic templates.

There are times when you need to write functions that take a variable number of arguments: variadic functions. To do this in a typesafe manner for polymorphic functions, you really need to take a variable number of types in a template. Before C++11, the only way to write variadic functions was with the ellipsis (...) syntax and the va_* facilities. These facilities did not enable type safety and can be difficult to use.

As an example, let’s say we want to abstract the launching of GPU kernels. In my case, I want to provide simpler launch semantics in the Hemi library. There are many cases where you don’t care to specify the number and size of thread blocks—you just want to run a kernel with “enough” threads to fully utilize the GPU, or to cover your data size. In that case we can let the library decide how to launch the kernel, simplifying our code. But to launch arbitrary kernels, we have to support arbitrary type signatures. Well, we can do that like this:

template <typename... Arguments>
void cudaLaunch(const ExecutionPolicy &p, 
                Arguments... args);

Here, Arguments... is a “type template parameter pack”. We can use it to refer to the type signature of our kernel function pointer f, and to the arguments of cudaLaunch. To do the same thing before C++11 (and CUDA 7) required providing multiple implementations of cudaLaunch, one for each number of arguments we wanted to support. That meant you had to limit the maximum number of arguments allowed, as well as the amount of code you had to maintain. In my experience this was prone to bugs. Here’s the implementation of cudaLaunch. Continue reading


The Power of C++11 in CUDA 7

Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone!

LambdaCUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself here. Instead, I wanted to take a deeper look at C++11 support in device code.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. New C++ language features include auto, lambda functions, variadic templates, static_assert, rvalue references, range-based for loops, and more. To enable C++11 support, pass the flag --std=c++11 to nvcc (this option is not required for Microsoft Visual Studio).

In my earlier CUDA 7 feature overview post, I presented a small example to show some C++11 features. Let’s dive into a somewhat expanded example to show the power of C++11 for CUDA programmers. This example will proceed top-down, covering a couple of layers of abstraction that allow us to write concise, reusable C++ code for the GPU, all enabled by C++11. The complete example is available on Github.

Let’s say we have a very specific (albeit contrived) goal: count the number of characters from a certain set within a text. (In parallel, of course!) Here’s a simple CUDA C++11 kernel that abstracts the mechanics of this a bit.

void xyzw_frequency(int *count, char *text, int n)
    const char letters[] { 'x','y','z','w' };

    count_if(count, text, n, [&](char c) {
        for (const auto x : letters) 
            if (c == x) return true;
        return false;

Continue reading


DIGITS: Deep Learning GPU Training System

The hottest area in machine learning today is Deep Learning, which uses Deep Neural Networks (DNNs) to teach computers to detect recognizable concepts in data. Researchers and industry practitioners are using DNNs in image and video classification, computer vision, speech recognition, natural language processing, and audio recognition, among other applications.

The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training these large, complex DNNs, reducing training time from months to only a few days. The major deep learning software frameworks have incorporated GPU acceleration, including Caffe, Torch7, Theano, and CUDA-Convnet2. Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, last year NVIDIA introduced cuDNN, a library of primitives for deep neural networks.

Today at the GPU Technology Conference, NVIDIA CEO and co-founder Jen-Hsun Huang introduced DIGITS, the first interactive Deep Learning GPU Training System. DIGITS is a new system for developing, training and visualizing deep neural networks. It puts the power of deep learning into an intuitive browser-based interface, so that data scientists and researchers can quickly design the best DNN for their data using real-time network behavior visualization. DIGITS is open-source software, available on GitHub, so developers can extend or customize it or contribute to the project.

Figure 1: DIGITS console
Figure 1: DIGITS console

Deep Learning is an approach to training and employing multi-layered artificial neural networks to assist in or complete a task without human intervention. DNNs for image classification typically use a combination of convolutional neural network (CNN) layers and fully connected layers made up of artificial neurons tiled so that they respond to overlapping regions of the visual field. Continue reading


GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let’s focus on one of the most common use cases: an image histogram using 256 bins for each color channel. Even though we’ll use a specific problem setup the same algorithms can benefit other computational domains as well.

A basic serial image histogram computation is relatively simple. For each pixel of the image and for each RGB color channel we find a corresponding integer bin from 0 to 255 and increment its value. Atomic operations are a natural way of implementing histograms on parallel architectures. Depending on the input distribution, some bins will be used much more than others, so it is necessary to support efficient accumulation of the values across the full memory hierarchy. This is similar to reduction and scan operations, but the main challenge with histograms is that the output location for each element is not known prior to reading its value. Therefore, it is impossible to create a generic parallel accumulation scheme that completely avoids collisions. Histograms are now much easier to handle on GPU architectures thanks to the improved atomics performance in Kepler and native support of shared memory atomics in Maxwell.

histogram algorithm
Figure 1: The two-phase parallel histogram algorithm.

Our histogram implementation has two phases and two corresponding CUDA C++ kernels, as Figure 1 shows. In the first phase each CUDA thread block processes a region of the image and accumulates a corresponding local histogram, storing the local histogram in global memory at the end of the phase. The second kernel accumulates all per-block histograms into the final histogram stored in global memory. The work separation between blocks in the first phase reduces contention when accumulating values into the same bin. Continue reading


HPC Visualization on NVIDIA Tesla GPUs

HPC looks very different today than it did when I was a graduate student in the mid-90s. Today’s supercomputers are many orders of magnitude faster than the machines of the 90s, and GPUs have helped push arithmetic performance on several leading systems to stratospheric levels. Unfortunately, the arithmetic performance wrought by two decades of supercomputer design has created tremendous I/O and visualization challenges that must be overcome, reflected by the famous statement:

“A supercomputer is a device for turning compute-bound problems into
I/O-bound problems.” — Ken Batcher

Molecular visualization with VMD

Since 1998, I’ve been leading the development of VMD, a popular molecular visualization and analysis application that is used by scientists all over the world. Among similar programs, VMD is particularly focused on capabilities that support large-scale molecular dynamics simulations and cellular modeling.

The movies in this article are examples of the kind of visualizations we regularly produce with parallel VMD visualization runs that use OptiX and/or OpenGL running on the Tesla GPUs in the Blue Waters and Titan supercomputers. These example movies highlight the science done by my colleagues in the Theoretical and Computational Biophysics Group, led by Prof. Klaus Schulten at U. Illinois.

VMD chromatophore light harvesting visualization produced using GPU-accelerated molecular surface determination (CUDA) and parallel ray tracing (OptiX, CUDA, and MPI), running on Tesla K20X GPUs in the NCSA Blue Waters petascale supercomputer. This visualization was shown in Klaus Schulten’s National Lecture at the 59th Biophysical Society Meeting. A longer version of this movie won the SC’14 visualization and data analytics showcase, and is described in an accompanying paper.

Some key areas of our ongoing VMD development involve the continued adaptation of the program for petascale and exascale supercomputers, advancing the molecular visualization state-of-the-art with parallel and interactive ray tracing techniques, exploiting massively parallel GPU accelerators for both visualization and analysis tasks, and supporting remote visualization and collaboration on HPC platforms. The combination of these VMD development tracks and current technological progress in HPC, GPUs, and visualization algorithms is leading in a very exciting direction. Continue reading

Python Logo

GPU-Accelerated Graph Analytics in Python with Numba

Numba is an open-source just-in-time (JIT) Python compiler that generates native machine code for X86 CPU and CUDA GPU from annotated Python Code. (Mark Harris introduced Numba in the post “NumbaPro: High-Performance Python with CUDA Acceleration”.) Numba specializes in Python code that makes heavy use of NumPy arrays and loops. In addition to JIT compiling NumPy array code for the CPU or GPU, Numba exposes “CUDA Python”: the CUDA programming model for NVIDIA GPUs in Python syntax.

By speeding up Python, we extend its ability from a glue language to a complete programming environment that can execute numeric code efficiently.

From Prototype to Full Dataset with @cuda.jit

When doing exploratory programming, the interactivity of IPython Notebook and a comprehensive collection of scientific libraries (e.g. SciPy, Scikit-Learn, Theano, etc.) allow data scientists to process and visualize their data quickly. There are times when a fast implementation of what you need isn’t in a library, and you have to implement something new. Numba helps by letting you write pure Python code and run it with speed comparable to a compiled language, like C++. Your development cycle shortens when your prototype Python code can scale to process the full dataset in a reasonable amount of time.

Figure 1: The DkS result of the 2012 Web Data Commons pay-level domain hyperlink graph.

Working with Dr. Alex Dimakis and his team at UT Austin, we implemented their densest-k-subgraph (DkS) algorithm [1]. Our goal was to extract the densest domain from the 2012 WebDataCommon pay-level-domain hyperlink graph using one NVIDIA Tesla K20 GPU accelerator. We developed the entire application using NumPy for array operations, Numba to JIT compile Python to CUDA, NumbaPro for GPU sorting and cuBLAS routines, and Bokeh for plotting the results. Continue reading


12 GTC 2015 Sessions Not to Miss

With one week to go until we all descend on GTC 2015, I’ve scoured through the list of Accelerated Computing sessions and put together 12 diverse “not to miss” talks you should add to your planner. This year, the conference is highlighting the revolution in Deep Learning that will affect every aspect of computing. GTC 2015 includes over 40 session categories, including deep learning and machine learning, scientific visualization, cloud computing, and HPC.

This is the place where scientists, programmers, researchers, and a
myriad of creative professionals convene to tap into the power of a GPU
for more than gaming. –Forbes

Tuesday, March 17

An Introduction to CUDA Programming (S5661)


This is the introductory tutorial intended for those new to CUDA and you will leave with the essential knowledge to start programming in CUDA – no experience is needed! For those that have prior CUDA experience, this is a great session to brush up on key concepts required for subsequent tutorials on CUDA optimization. The other tutorials in this session are: An Introduction to the GPU Memory ModelAsynchronous Operations and Dynamic Parallelism in CUDA and Essential CUDA Optimization Techniques.

GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.
GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.

SMTool: A GPU based Satellite Image Analysis Tool (S5201)


Dilip Patlolla, R&D Engineer in the Geographic Information Science and Technology (GIST) Group at the Oak Ridge National Laboratory, will demonstrate their advanced satellite image analytic tool referred as SMTool built on the CUDA platform to process city-scale sub-meter resolution satellite imagery to detect and discriminate man-made structures. Continue reading


Understanding Natural Language with Deep Neural Networks Using Torch

This post was co-written by Soumith Chintala and Wojciech Zaremba of Facebook AI Research.

Language is the medium of human communication. Giving machines the ability to learn and understand language enables products and possibilities that are not imaginable today.

One can understand language at varying granularities. When you learn a new language, you start with words: understanding their meaning, identifying similar and dissimilar words, and developing a sense of contextual appropriateness of a word. You start with a small dictionary of words, building up your dictionary over time, mentally mapping each newly learned word close to similar words in your dictionary. Once you get familiar with your dictionary of words, you put them together into small sentences, learning grammar and structure. You eventually combine sentences in a sensible way, to write paragraphs and pages. Once you get to this stage, you are comfortable with expressing complicated thoughts in language, letting others understand your thoughts and expression.

As an example, language understanding gives one the ability to understand that the sentences “I’m on my way home.” and “I’m driving back home.” both convey that the speaker is going home.

Word Maps and Language Models

For a machine to understand language, it first has to develop a mental map of words, their meanings and interactions with other words. It needs to build a dictionary of words, and understand where they stand semantically and contextually, compared to other words in their dictionary. To achieve this, each word is mapped to a set of numbers in a high-dimensional space, which are called “word embeddings”. Similar words are close to each other in this number space, and dissimilar words are far apart. Some word embeddings encode mathematical properties such as addition and subtraction (For some examples, see Table 1).

Word embeddings can either be learned in a general-purpose fashion before-hand by reading large amounts of text (like Wikipedia), or specially learned for a particular task (like sentiment analysis). We go into a little more detail on learning word embeddings in a later section.

Table 1: Mikolov et. al. [3] showcase simple additive properties of their word embeddings.
Expression Nearest Token
Paris – France + Italy Rome
bigger – big + cold colder
sushi – Japan + Germany bratwurst
Cu – copper + gold Au
Windows – Microsoft + Google Android
Montral Canadiens – Montreal + Toronto Toronto Maple Leafs

After the machine has learned word embeddings, the next problem to tackle is the ability to string words together appropriately in small, grammatically correct sentences which make sense. This is called language modeling. Language modeling is one part of quantifying how well the machine understands language. Continue reading


Deep Speech: Accurate Speech Recognition with GPU-Accelerated Deep Learning

Speech recognition is an established technology, but it tends to fail when we need it the most, such as in noisy or crowded environments, or when the speaker is far away from the microphone. At Baidu we are working to enable truly ubiquitous, natural speech interfaces. In order to achieve this, we must improve the accuracy of speech recognition, especially in these challenging environments. We set out to make progress towards this goal by applying Deep Learning in a new way to speech recognition.

Figure 1: The structure of our deep neural network, showing the layers (top to bottom) and how we parallelize training across GPUs (left to right). The fourth layer is a bidirectional recurrent layer. Blue and red arrows indicate the forward and backward direction and the communication required between GPUs in these layers.
Figure 1: The structure of our deep neural network, showing the layers (top to bottom) and how we parallelize training across GPUs (left to right). The fourth layer is a bidirectional recurrent layer. Blue and red arrows indicate the forward and backward direction and the communication required between GPUs in these layers.

Deep Learning has transformed many important tasks; it has been successful because it scales well: it can absorb large amounts of data to create highly accurate models. Indeed, most industrial speech recognition systems rely on Deep Neural Networks as a component, usually combined with other algorithms. Many researchers have long believed that Deep Neural Networks (DNNs) could provide even better accuracy for speech recognition if they were used for the entire system, rather than just as the acoustic modeling component. However, it has proven difficult to find an end-to-end speech recognition system based on Deep Learning that improves on the state of the art.

Model and Data Co-design

One of the reasons this has been difficult is that training these networks on large datasets is computationally very intensive. The process of training DNNs is iterative: we instantiate ideas about models in computer code that trains a model, then we train the model on a training set and test it, which gives us new ideas about how to improve the model or training set. The latency of this loop is the rate limiting step that gates progress. Our models are relatively large, containing billions of connections, and we train them on thousands of hours of data, which means that training our models takes a lot of computation. Continue reading