CUDA 7

The Power of C++11 in CUDA 7

Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone!

LambdaCUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself here. Instead, I wanted to take a deeper look at C++11 support in device code.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. New C++ language features include auto, lambda functions, variadic templates, static_assert, rvalue references, range-based for loops, and more. To enable C++11 support, pass the flag --std=c++11 to nvcc (this option is not required for Microsoft Visual Studio).

In my earlier CUDA 7 feature overview post, I presented a small example to show some C++11 features. Let’s dive into a somewhat expanded example to show the power of C++11 for CUDA programmers. This example will proceed top-down, covering a couple of layers of abstraction that allow us to write concise, reusable C++ code for the GPU, all enabled by C++11. The complete example is available on Github.

Let’s say we have a very specific (albeit contrived) goal: count the number of characters from a certain set within a text. (In parallel, of course!) Here’s a simple CUDA C++11 kernel that abstracts the mechanics of this a bit.

__global__
void xyzw_frequency(int *count, char *text, int n)
{
    const char letters[] { 'x','y','z','w' };

    count_if(count, text, n, [&](char c) {
        for (const auto x : letters) 
            if (c == x) return true;
        return false;
    });
}

Continue reading

dnn_green_on_black_thumb

DIGITS: Deep Learning GPU Training System

The hottest area in machine learning today is Deep Learning, which uses Deep Neural Networks (DNNs) to teach computers to detect recognizable concepts in data. Researchers and industry practitioners are using DNNs in image and video classification, computer vision, speech recognition, natural language processing, and audio recognition, among other applications.

The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training these large, complex DNNs, reducing training time from months to only a few days. The major deep learning software frameworks have incorporated GPU acceleration, including Caffe, Torch7, Theano, and CUDA-Convnet2. Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, last year NVIDIA introduced cuDNN, a library of primitives for deep neural networks.

Today at the GPU Technology Conference, NVIDIA CEO and co-founder Jen-Hsun Huang introduced DIGITS, the first interactive Deep Learning GPU Training System. DIGITS is a new system for developing, training and visualizing deep neural networks. It puts the power of deep learning into an intuitive browser-based interface, so that data scientists and researchers can quickly design the best DNN for their data using real-time network behavior visualization. DIGITS is open-source software, available on GitHub, so developers can extend or customize it or contribute to the project.

Figure 1: DIGITS console
Figure 1: DIGITS console

Deep Learning is an approach to training and employing multi-layered artificial neural networks to assist in or complete a task without human intervention. DNNs for image classification typically use a combination of convolutional neural network (CNN) layers and fully connected layers made up of artificial neurons tiled so that they respond to overlapping regions of the visual field. Continue reading

maxwell_thumb

GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell

Histograms are an important data representation with many applications in computer vision, data analytics and medical imaging. A histogram is a graphical representation of the data distribution across predefined bins. The input data set and the number of bins can vary greatly depending on the domain, so let’s focus on one of the most common use cases: an image histogram using 256 bins for each color channel. Even though we’ll use a specific problem setup the same algorithms can benefit other computational domains as well.

A basic serial image histogram computation is relatively simple. For each pixel of the image and for each RGB color channel we find a corresponding integer bin from 0 to 255 and increment its value. Atomic operations are a natural way of implementing histograms on parallel architectures. Depending on the input distribution, some bins will be used much more than others, so it is necessary to support efficient accumulation of the values across the full memory hierarchy. This is similar to reduction and scan operations, but the main challenge with histograms is that the output location for each element is not known prior to reading its value. Therefore, it is impossible to create a generic parallel accumulation scheme that completely avoids collisions. Histograms are now much easier to handle on GPU architectures thanks to the improved atomics performance in Kepler and native support of shared memory atomics in Maxwell.

histogram algorithm
Figure 1: The two-phase parallel histogram algorithm.

Our histogram implementation has two phases and two corresponding CUDA C++ kernels, as Figure 1 shows. In the first phase each CUDA thread block processes a region of the image and accumulates a corresponding local histogram, storing the local histogram in global memory at the end of the phase. The second kernel accumulates all per-block histograms into the final histogram stored in global memory. The work separation between blocks in the first phase reduces contention when accumulating values into the same bin. Continue reading

hivmovie_690_optix-thumb

HPC Visualization on NVIDIA Tesla GPUs

HPC looks very different today than it did when I was a graduate student in the mid-90s. Today’s supercomputers are many orders of magnitude faster than the machines of the 90s, and GPUs have helped push arithmetic performance on several leading systems to stratospheric levels. Unfortunately, the arithmetic performance wrought by two decades of supercomputer design has created tremendous I/O and visualization challenges that must be overcome, reflected by the famous statement:

“A supercomputer is a device for turning compute-bound problems into
I/O-bound problems.” — Ken Batcher

Molecular visualization with VMD

Since 1998, I’ve been leading the development of VMD, a popular molecular visualization and analysis application that is used by scientists all over the world. Among similar programs, VMD is particularly focused on capabilities that support large-scale molecular dynamics simulations and cellular modeling.

The movies in this article are examples of the kind of visualizations we regularly produce with parallel VMD visualization runs that use OptiX and/or OpenGL running on the Tesla GPUs in the Blue Waters and Titan supercomputers. These example movies highlight the science done by my colleagues in the Theoretical and Computational Biophysics Group, led by Prof. Klaus Schulten at U. Illinois.

VMD chromatophore light harvesting visualization produced using GPU-accelerated molecular surface determination (CUDA) and parallel ray tracing (OptiX, CUDA, and MPI), running on Tesla K20X GPUs in the NCSA Blue Waters petascale supercomputer. This visualization was shown in Klaus Schulten’s National Lecture at the 59th Biophysical Society Meeting. A longer version of this movie won the SC’14 visualization and data analytics showcase, and is described in an accompanying paper.

Some key areas of our ongoing VMD development involve the continued adaptation of the program for petascale and exascale supercomputers, advancing the molecular visualization state-of-the-art with parallel and interactive ray tracing techniques, exploiting massively parallel GPU accelerators for both visualization and analysis tasks, and supporting remote visualization and collaboration on HPC platforms. The combination of these VMD development tracks and current technological progress in HPC, GPUs, and visualization algorithms is leading in a very exciting direction. Continue reading

Python Logo

GPU-Accelerated Graph Analytics in Python with Numba

Numba is an open-source just-in-time (JIT) Python compiler that generates native machine code for X86 CPU and CUDA GPU from annotated Python Code. (Mark Harris introduced Numba in the post “NumbaPro: High-Performance Python with CUDA Acceleration”.) Numba specializes in Python code that makes heavy use of NumPy arrays and loops. In addition to JIT compiling NumPy array code for the CPU or GPU, Numba exposes “CUDA Python”: the CUDA programming model for NVIDIA GPUs in Python syntax.

By speeding up Python, we extend its ability from a glue language to a complete programming environment that can execute numeric code efficiently.

From Prototype to Full Dataset with @cuda.jit

When doing exploratory programming, the interactivity of IPython Notebook and a comprehensive collection of scientific libraries (e.g. SciPy, Scikit-Learn, Theano, etc.) allow data scientists to process and visualize their data quickly. There are times when a fast implementation of what you need isn’t in a library, and you have to implement something new. Numba helps by letting you write pure Python code and run it with speed comparable to a compiled language, like C++. Your development cycle shortens when your prototype Python code can scale to process the full dataset in a reasonable amount of time.

Densest-k-Subgraph
Figure 1: The DkS result of the 2012 Web Data Commons pay-level domain hyperlink graph.

Working with Dr. Alex Dimakis and his team at UT Austin, we implemented their densest-k-subgraph (DkS) algorithm [1]. Our goal was to extract the densest domain from the 2012 WebDataCommon pay-level-domain hyperlink graph using one NVIDIA Tesla K20 GPU accelerator. We developed the entire application using NumPy for array operations, Numba to JIT compile Python to CUDA, NumbaPro for GPU sorting and cuBLAS routines, and Bokeh for plotting the results. Continue reading

gtc-logo-2014_thumb

12 GTC 2015 Sessions Not to Miss

With one week to go until we all descend on GTC 2015, I’ve scoured through the list of Accelerated Computing sessions and put together 12 diverse “not to miss” talks you should add to your planner. This year, the conference is highlighting the revolution in Deep Learning that will affect every aspect of computing. GTC 2015 includes over 40 session categories, including deep learning and machine learning, scientific visualization, cloud computing, and HPC.

This is the place where scientists, programmers, researchers, and a
myriad of creative professionals convene to tap into the power of a GPU
for more than gaming. –Forbes

Tuesday, March 17

An Introduction to CUDA Programming (S5661)

1:00-2:20pm

This is the introductory tutorial intended for those new to CUDA and you will leave with the essential knowledge to start programming in CUDA – no experience is needed! For those that have prior CUDA experience, this is a great session to brush up on key concepts required for subsequent tutorials on CUDA optimization. The other tutorials in this session are: An Introduction to the GPU Memory ModelAsynchronous Operations and Dynamic Parallelism in CUDA and Essential CUDA Optimization Techniques.

GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.
GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.

SMTool: A GPU based Satellite Image Analysis Tool (S5201)

2:00-2:25pm

Dilip Patlolla, R&D Engineer in the Geographic Information Science and Technology (GIST) Group at the Oak Ridge National Laboratory, will demonstrate their advanced satellite image analytic tool referred as SMTool built on the CUDA platform to process city-scale sub-meter resolution satellite imagery to detect and discriminate man-made structures. Continue reading

torch_lstm_thumb

Understanding Natural Language with Deep Neural Networks Using Torch

This post was co-written by Soumith Chintala and Wojciech Zaremba of Facebook AI Research.

Language is the medium of human communication. Giving machines the ability to learn and understand language enables products and possibilities that are not imaginable today.

One can understand language at varying granularities. When you learn a new language, you start with words: understanding their meaning, identifying similar and dissimilar words, and developing a sense of contextual appropriateness of a word. You start with a small dictionary of words, building up your dictionary over time, mentally mapping each newly learned word close to similar words in your dictionary. Once you get familiar with your dictionary of words, you put them together into small sentences, learning grammar and structure. You eventually combine sentences in a sensible way, to write paragraphs and pages. Once you get to this stage, you are comfortable with expressing complicated thoughts in language, letting others understand your thoughts and expression.

As an example, language understanding gives one the ability to understand that the sentences “I’m on my way home.” and “I’m driving back home.” both convey that the speaker is going home.

Word Maps and Language Models

For a machine to understand language, it first has to develop a mental map of words, their meanings and interactions with other words. It needs to build a dictionary of words, and understand where they stand semantically and contextually, compared to other words in their dictionary. To achieve this, each word is mapped to a set of numbers in a high-dimensional space, which are called “word embeddings”. Similar words are close to each other in this number space, and dissimilar words are far apart. Some word embeddings encode mathematical properties such as addition and subtraction (For some examples, see Table 1).

Word embeddings can either be learned in a general-purpose fashion before-hand by reading large amounts of text (like Wikipedia), or specially learned for a particular task (like sentiment analysis). We go into a little more detail on learning word embeddings in a later section.

Table 1: Mikolov et. al. [3] showcase simple additive properties of their word embeddings.
Expression Nearest Token
Paris – France + Italy Rome
bigger – big + cold colder
sushi – Japan + Germany bratwurst
Cu – copper + gold Au
Windows – Microsoft + Google Android
Montral Canadiens – Montreal + Toronto Toronto Maple Leafs

After the machine has learned word embeddings, the next problem to tackle is the ability to string words together appropriately in small, grammatically correct sentences which make sense. This is called language modeling. Language modeling is one part of quantifying how well the machine understands language. Continue reading

deep_speech_thumb

Deep Speech: Accurate Speech Recognition with GPU-Accelerated Deep Learning

Speech recognition is an established technology, but it tends to fail when we need it the most, such as in noisy or crowded environments, or when the speaker is far away from the microphone. At Baidu we are working to enable truly ubiquitous, natural speech interfaces. In order to achieve this, we must improve the accuracy of speech recognition, especially in these challenging environments. We set out to make progress towards this goal by applying Deep Learning in a new way to speech recognition.

Figure 1: The structure of our deep neural network, showing the layers (top to bottom) and how we parallelize training across GPUs (left to right). The fourth layer is a bidirectional recurrent layer. Blue and red arrows indicate the forward and backward direction and the communication required between GPUs in these layers.
Figure 1: The structure of our deep neural network, showing the layers (top to bottom) and how we parallelize training across GPUs (left to right). The fourth layer is a bidirectional recurrent layer. Blue and red arrows indicate the forward and backward direction and the communication required between GPUs in these layers.

Deep Learning has transformed many important tasks; it has been successful because it scales well: it can absorb large amounts of data to create highly accurate models. Indeed, most industrial speech recognition systems rely on Deep Neural Networks as a component, usually combined with other algorithms. Many researchers have long believed that Deep Neural Networks (DNNs) could provide even better accuracy for speech recognition if they were used for the entire system, rather than just as the acoustic modeling component. However, it has proven difficult to find an end-to-end speech recognition system based on Deep Learning that improves on the state of the art.

Model and Data Co-design

One of the reasons this has been difficult is that training these networks on large datasets is computationally very intensive. The process of training DNNs is iterative: we instantiate ideas about models in computer code that trains a model, then we train the model on a training set and test it, which gives us new ideas about how to improve the model or training set. The latency of this loop is the rate limiting step that gates progress. Our models are relatively large, containing billions of connections, and we train them on thousands of hours of data, which means that training our models takes a lot of computation. Continue reading

GTC attendees learn from the brightest minds in accelerated computing with hundreds of talks and hands-on tutorials.

Learn GPU Computing with Hands-On Labs at GTC 2015

Every year NVIDIA’s GPU Technology Conference (GTC) gets bigger and better. One of the aims of GTC is to give developers, scientists, and practitioners opportunities to learn with hands-on labs how to use accelerated computing in their work. This year we are nearly doubling the amount of hands-on training provided from last year, with almost 2,400 lab hours available to GTC attendees!

We have two types of training this year at GTC: instructor-led labs and self-paced labs. And to help you keep up with one of the hottest trends in computing, this year we’re featuring a Deep Learning training track. Keep reading for details. If you haven’t registered for GTC yet this year, keep reading for a discount code.

Deep Learning Track

There is an explosion of Deep Learning topics at GTC, and it’s not limited to the keynotes, talks and tutorial sessions. We’ll feature at least six hands-on labs related to accelerating facets of Deep Learning on GPUs. From an introduction to Deep Learning on GPUs to cutting-edge techniques and tools, there will be something for everyone. Be sure to get to these labs early to get yourself a seat! Here are a few of the labs available in this track:

  • Introduction to Machine Learning with GPUs: Handwritten digit classification (S5674)
  • DIY Deep Learning for Vision with Caffe (S5647)
  • Applied Deep Learning for Vision, Natural Language and Audio with Torch7 (S5574)
  • Deep Learning with the Theano Python Library (S5732)
  • Deep Belief Networks Using ArrayFire (S5722)
  • Accelerate a Machine Learning C++ example with Thrust (S5822)

Instructor-led Labs

IMAG0568Just like GTC last year, there will be twenty hands-on instructor-led labs. These are 80-minute labs led by an expert on the topic. Continue reading

BIDMach: Machine Learning at the Limit with GPUs

Deep learning has made enormous leaps forward thanks to GPU hardware. But much Big Data analysis is still done with classical methods on sparse data. Tasks like click prediction, personalization, recommendation, search ranking, etc. still account for most of the revenue from commercial data analysis. The role of GPUs in that realm has been less clear. In the BIDMach project (part of the BID Data Project at UC Berkeley), we have been exploring general machine learning with GPUs. The results are remarkable: not only do we see order-of-magnitude speedups for most problems, but our system also outperforms today’s cluster computing systems running up to several hundred nodes on typical workloads. As well as the incentives to adopt GPU technology for deep learning tasks, there is now a strong incentive for organizations to migrate to GPUs for the remainder of their analytics workload.

Roofline Design

To build the fastest system, we borrowed the approach of roofline design from computer architecture. Roofline design involves designing to fundamental limits (e.g. ALU throughput, memory speed, network speed, I/O speed etc). A rooflined system is fast, and no other system can be much faster, since both have to respect the same hardware limits. A roofline diagram for matrix multiply is shown below:

Figure 1: CPU and GPU roofline limits
Figure 1: CPU and GPU roofline limits

The y-axis shows the potential throughput in arithmetic operations/second. The x-axis is “operational intensity” which is the number of operations applied to each data value (in units of operations per byte). The intensity is much lower for sparse operations – e.g. sparse matrix multiply typically involves only a multiply and add for each input datum, while dense matrix multiply uses each datum many times. The horizontal lines reflect the maximum ALU throughput for each type of processor (the graph is drawn for Intel i7 and NVIDIA GeForce GTX 680 processors). GPUs have much higher ALU throughput since the GPU chip area is almost entirely ALU. For dense matrix multiply, GPUs are10x faster thanks to this higher computing area.

The diagonal lines reflect memory bandwidth. Since bandwidth is flow in bytes/second it defines a linear relationship between the x-axis (flops/byte) and the y-axis (in flops/sec). A less well-known feature of GPUs is their higher main-memory bandwidth. This leads to a (potential) 10x gap in sparse matrix operations, which are the most important for many machine learning tasks. CuSparse achieves this ceiling for typical scientific data, but we found it was less well fit to very sparse data (text, web logs etc.). We wrote our own sparse kernels and were able to get them close to the roofline limits over a full range of sparseness. These kernels form the basis for the high throughput in most of BIDMach’s algorithms. Continue reading