Inference: The Next Step in GPU-Accelerated Deep Learning

At 45 images/s/W, Jetson TX1 is super efficient at deep learning inference. Read the whitepaper.
Deep learning is revolutionizing many areas of machine perception, with the potential to impact the everyday experience of people everywhere. On a high level, working with deep neural networks is a two-stage process: First, a neural network is trained: its parameters are determined using labeled examples of inputs and desired output. Then, the network is deployed to run inference, using its previously trained parameters to classify, recognize and process unknown inputs.

Deep Neural Network Training vs. Inference
Figure 1: Deep learning training compared to inference. In training, many inputs, often in large batches, are used to train a deep neural network. In inference, the trained network is used to discover information within new inputs that are fed through the network in smaller batches.

It is widely recognized within academia and industry that GPUs are the state of the art in training deep neural networks, due to both speed and energy efficiency advantages compared to more traditional CPU-based platforms. A new whitepaper from NVIDIA takes the next step and investigates GPU performance and energy efficiency for deep learning inference.

The results show that GPUs provide state-of-the-art inference performance and energy efficiency, making them the platform of choice for anyone wanting to deploy a trained neural network in the field. In particular, the NVIDIA GeForce GTX Titan X delivers between 5.3 and 6.7 times higher performance than the 16-core Intel Xeon E5 CPU while achieving 3.6 to 4.4 times higher energy efficiency. The NVIDIA Tegra X1 SoC also achieves impressive results, achieving higher performance (258 vs. 242 images/second) and much higher energy efficiency (45 vs. 3.9 images/second/Watt) than the state-of-the-art Intel Core i7 6700K. Continue reading


Deep Learning in a Nutshell: Core Concepts

DL_dog_340x340This post is the first in a series I’ll be writing for Parallel Forall that aims to provide an intuitive and gentle introduction to deep learning. It covers the most important deep learning concepts and aims to provide an understanding of each concept rather than its mathematical and theoretical details. While the mathematical terminology is sometimes necessary and can further understanding, these posts use analogies and images whenever possible to provide easily digestible bits comprising an intuitive overview of the field of deep learning.

I wrote this series in a glossary style so it can also be used as a reference for deep learning concepts.

Part 1 focuses on introducing the main concepts of deep learning. Future posts will provide historical background and delve into the training procedures, algorithms and practical tricks that are used in training for deep learning.

Core Concepts

Machine Learning

In machine learning we (1) take some data, (2) train a model on that data, and (3) use the trained model to make predictions on new data. The process of training a model can be seen as a learning process where the model is exposed to new, unfamiliar data step by step. At each step, the model makes predictions and gets feedback about how accurate its generated predictions were. This feedback, which is provided in terms of an error according to some measure (for example distance from the correct solution), is used to correct the errors made in prediction.

The learning process is often a game of back-and-forth in the parameter space: If you tweak a parameter of the model to get a prediction right, the model may have in such that it gets a previously correct prediction wrong. It may take many iterations to train a model with good predictive performance. This iterative predict-and-adjust process continues until the predictions of the model no longer improve.

Feature Engineering

Feature engineering is the art of extracting useful patterns from data that will make it easier for Machine Learning models to distinguish between classes. For example, you might take the number of greenish vs. bluish pixels as an indicator of whether a land or water animal is in some picture. This feature is helpful for a machine learning model because it limits the number of classes that need to be considered for a good classification. Continue reading


Deep Learning for Computer Vision with MATLAB and cuDNN

Deep learning is becoming ubiquitous. With recent advancements in deep learning algorithms and GPU technology, we are able to solve problems once considered impossible in fields such as computer vision, natural language processing, and robotics.

Figure 1: Pet detection and recognition system.
Figure 1: Pet detection and recognition system.

Deep learning uses deep neural networks which have been around for a few decades; what’s changed in recent years is the availability of large labeled datasets and powerful GPUs. Neural networks are inherently parallel algorithms and GPUs with thousands of cores can take advantage of this parallelism to dramatically reduce computation time needed for training deep learning networks. In this post, I will discuss how you can use MATLAB to develop an object recognition system using deep convolutional neural networks and GPUs.

Why Deep Learning for Computer Vision?

Machine learning techniques use data (images, signals, text) to train a machine (or model) to perform a task such as image classification, object detection, or language translation. Classical machine learning techniques are still being used to solve challenging image classification problems. However, they don’t work well when applied directly to images, because they ignore the structure and compositional nature of images. Until recently, state-of-the-art techniques made use of feature extraction algorithms that extract interesting parts of an image as compact low-dimensional feature vectors. These were then used along with traditional machine learning algorithms.

Enter Deep learning. Deep convolutional neural networks (CNNs), a specific type of deep learning algorithm, address the gaps in traditional machine learning techniques, changing the way we solve these problems. CNNs not only perform classification, but they can also learn to extract features directly from raw images, eliminating the need for manual feature extraction. For computer vision applications you often need more than just image classification; you need state-of-the-art computer vision techniques for object detection, a bit of domain expertise, and the know-how to set up and use GPUs efficiently. Through the rest of this post, I will use an object recognition example to illustrate how easy it is to use MATLAB for deep learning, even if you don’t have extensive knowledge of computer vision or GPU programming. Continue reading


Mocha.jl: Deep Learning for Julia

Deep learning is becoming extremely popular due to several breakthroughs in various well-known tasks in artificial intelligence. For example, at the ImageNet Large Scale Visual Recognition Challenge, the introduction of deep learning algorithms into the challenge reduced the top-5 error by 10% in 2012. Every year since then, deep learning models have dominated the challenges, significantly reducing the top-5 error rate every year (see Figure 1). In 2015, researchers have trained very deep networks (for example, the Google “inception” model has 27 layers) that surpass human performance.

Figure 1: The top-5 error rate in the ImageNet Large Scale Visual Recognition Challenge has been rapidly reducing since the introduction of deep neural networks in 2012.
Figure 1: The top-5 error rate in the ImageNet Large Scale Visual Recognition Challenge has been rapidly reducing since the introduction of deep neural networks in 2012.

Moreover, at this year’s Computer Vision and Pattern Recognition (CVPR) conference, deep neural networks (DNNs) were being adapted to increasingly more complicated tasks. For example, in semantic segmentation, instead of predicting a single category for a whole image, a DNN is trained to classify each pixel in the image, essentially producing a semantic map indicating every object and its shape and location in the given image (see Figure 2).

Figure 2: An example of generating a semantic map by classifying each pixel in a source image. Source: Shuai Zheng et al. Conditional Random Fields as Recurrent Neural Networks.
Figure 2: An example of generating a semantic map by classifying each pixel in a source image. Source: Shuai Zheng et al. Conditional Random Fields as Recurrent Neural Networks.

Continue reading


Harnessing the Caffe Framework for Deep Visualization

Jeff Clune, Lab Director, Evolving Intelligence Laboratory at The University of Wyoming.
Jeff Clune, Lab Director, Evolving Intelligence Laboratory at The University of Wyoming.

The need to train their deep neural networks as fast as possible led the Evolving Artificial Intelligence Laboratory at the University of Wyoming to harness the power of NVIDIA Tesla GPUs starting in 2012 to accelerate their research.

“The speedups GPUs provide for training deep neural networks are well-documented and allow us to train models in a week that would otherwise take months,” said Jeff Clune, Assistant Professor, Computer Science Department and Director of the Evolving Artificial Intelligence Laboratory. “And algorithms continuously improve. Recently, NVIDIA’s cuDNN technology allowed us to speed up our training time by an extra 20% or so.”

Clune’s Lab, which focuses on evolving artificial intelligence with a major focus on large-scale, structurally organized neural networks, has garnered press from some of the largest media outlets, including BBC, National Geographic, NBC News, The Atlantic and featured on the cover of Nature in May 2015.

[The following video shows off work from the Evolving AI Lab on visualizing deep neural networks. Keep reading to learn more about this work!]

For this Spotlight interview, I had the opportunity to talk with Jeff Clune and two of his collaborators, Anh Nguyen, a Ph.D. student at the Evolving AI Lab and Jason Yosinski, a Ph.D. candidate at Cornell University.

Brad: How are you using deep neural networks (DNNs)?

We have many research projects involving deep neural networks. Our Deep Learning publications to date involve better understanding DNNs. Our lab’s research covers: Continue reading


NVIDIA and IBM Cloud Support ImageNet Large Scale Visual Recognition Challenge

This year’s ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is about to begin. Every year, organizers from the University of North Carolina at Chapel Hill, Stanford University, and the University of Michigan host the ILSVRC, an object detection and image classification competition, to advance the fields of machine learning and pattern recognition. Competitors are given more than 1.2 million images from 1,000 different object categories, with the objective of developing the most accurate object detection and classification approach. After the data set is released, teams have roughly three months to create, refine and test their approaches. This year, the data will be released on August 14, and the final result submission deadline is November 13. Read on to learn how competing teams can get free access to the latest GPU-accelerated cloud servers from NVIDIA and IBM Cloud.

Figure 1. Examples of ImageNet images demonstrating classification with localization.
Figure 1. Examples of ImageNet images demonstrating classification with localization.

About the Challenge

Teams compete annually to develop the most accurate recognition systems, and each year the sub-tasks are more complex and challenging. Since none of the algorithms developed to date are able to classify all images correctly 100% of the time, classification accuracy is measured in terms of error rates. In this way, the winning submission will exhibit the lowest overall percentage of incorrectly classified images. When ILSVRC started in 2010, teams had a single assignment: classify the contents of an image (identify objects) and provide a list of the top five classifications with their respective probabilities. Since then, the complexity of assignments has grown as the organizers have added sub-tasks for classification with localization and with detection (identifying objects and specifying their locations with bounding boxes). Specifying bounding boxes is important because most pictures contain multiple objects. To illustrate the requirements of this type of task, Figure 1 shows two sample ImageNet training images annotated with their object categories and bounding boxes. Continue reading


Labellio: Scalable Cloud Architecture for Efficient Multi-GPU Deep Learning

Labellio is the world’s easiest deep learning web service for computer vision. It aims to provide a deep learning environment for image data where non-experts in deep learning can experiment with their ideas for image classification applications. Watch our video embedded here to see how easy it is.

The challenges in deep learning today are not just in configuring hyperparameters or designing a suitable neural network structure; they are also in knowing how to prepare good training data to fine-­tune a well­-known working classification model, and in knowing how to set up a computing environment without hassle. Labellio aims to address these problems by supporting the deep learning workflow from beginning to end.

The Labellio training data collector lets users upload images, or downloads images from the internet from specified URL lists or keyword searches on external services like Flickr and Bing. While some users have their own domain­-specific training data in hand already, others don’t. Labellio’s data ingestion capabilities help both types of users start building their own deep learning models immediately. It is extremely important to assign the correct class to each training image to build a more accurate classification model, so Labellio’s labelling capability helps users cleanse input training data.

Labellio helps users build and manage multiple classification models. Creating a classification model is typically not just a one-time operation; you need trial and error to develop the most accurate model by replacing some of the training data. Continue reading

Theano Logo

Introduction to Neural Machine Translation with GPUs (part 3)

Note: This is the final part of a detailed three-part series on machine translation with neural networks by Kyunghyun Cho. You may enjoy part 1 and part 2.

In the previous post in this series, I introduced a simple encoder-decoder model for machine translation. This simple encoder-decoder model is excellent at English-French translation. However, in this post I will briefly discuss the weakness of this simple approach, and describe a recently proposed way of incorporating a soft attention mechanism to overcome the weakness and significantly improve the translation quality.

Furthermore, I will present some more recent works that utilize this neural machine translation approach to go beyond machine translation of text, such as image caption generation and video description generation. I’ll finish the blog series with a brief discussion of future research directions and a pointer to the open source code implementing these neural machine translation models.

The Trouble with Simple Encoder-Decoder Architectures

In the encoder-decoder architecture, the encoder compresses the input sequence as a fixed-size vector from which the decoder needs to generate a full translation. In other words, the fixed-size vector, which I’ll call a context vector, must contain every single detail of the source sentence. Intuitively, this means that the true function approximated by the encoder has to be extremely nonlinear and complicated. Furthermore, the dimensionality of the context vector must be large enough that a sentence of any length can be compressed.

In my paper “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches” presented at SSST-8, my coauthors and I empirically confirmed that translation quality dramatically degrades as the length of the source sentence increases when the encoder-decoder model size is small. Together with a much better result from Sutskever et al. (2014), using the same type of encoder-decoder architecture, this suggests that the representational power of the encoder needed to be large, which often means that the model must be large, in order to cope with long sentences (see Figure 1).

 Figure 1: Dramatic drop of performance w.r.t. the length of sentence with a small encoder-decoder model.
Figure 1: Dramatic drop of performance w.r.t. the length of sentence with a small encoder-decoder model.

Of course, a larger model implies higher computation and memory requirements. The use of advanced GPUs, such as NVIDIA Titan X, indeed helps with computation, but not with memory (at least not yet). The size of onboard memory is often limited to several Gigabytes, and this imposes a serious limitation on the size of the model. (Note: it’s possible to overcome this issue by using multiple GPUs while distributing a single model across those GPUs, as shown by Sutskever et al. (2014). However, let’s assume for now that we have access to a single machine with a single GPU due to space, power and other physical constraints.)

Then, the question is “can we do better than the simple encoder-decoder based model?” Continue reading


Using GPUs to Accelerate Epidemic Forecasting

Chris_JewellOriginally trained as a veterinary surgeon, Chris Jewell, a Senior Lecturer in Epidemiology at Lancaster Medical School in the UK became interested in epidemics through his experience working on the foot and mouth disease outbreak in the UK in 2001. His work so far has been on livestock epidemics such as foot and mouth disease, theileriosis, and avian influenza with government organizations in the UK, New Zealand, Australia, and the US. Recently, he has refocused his efforts into the human field where populations and epidemics tend to be larger and therefore need more computing grunt.

Epidemic forecasting centers around Bayesian inference on dynamical models, using Markov Chain Monte Carlo (MCMC) as the model fitting algorithm. As part of this algorithm Chris has had to calculate a statistical likelihood function which itself involves a large sum over pairs of infected and susceptible individuals. He is currently using CUDA technology to accelerate this calculation and enable real-time inference, leading to timely forecasts for informing control decisions.

“Without CUDA technology, the MCMC is simply too slow to be of practical use during a disease outbreak,” he says. “With the 380x speedup over a single core non-vector CPU code, real-time forecasting is now a reality!”

Figure 1: Predicting the risk of infection by the tick-borne cattle parasite Theileria orientalis (Ikeda) for uninfected farms. Source: Jewell, CP and Brown RG (2015) Bayesian data assimilation provides rapid decision support for vector-borne diseases.  J. Roy. Soc. Interface 12:20150367.
Figure 1: Predicting the risk of infection by the tick-borne cattle parasite Theileria orientalis (Ikeda) for uninfected farms. Source: Jewell, CP and Brown RG (2015) Bayesian data assimilation provides rapid decision support for vector-borne diseases. J. Roy. Soc. Interface 12:20150367.

Continue reading


Easy Multi-GPU Deep Learning with DIGITS 2

DIGITS is an interactive deep learning development tool for data scientists and researchers, designed for rapid development and deployment of an optimized deep neural network. NVIDIA introduced DIGITS in March 2015, and today we are excited to announce the release of DIGITS 2, which includes automatic multi-GPU scaling. Whether you are developing an optimized neural network for a single data set or training multiple networks on many data sets, DIGITS 2 makes it easier and faster to develop optimized networks in parallel with multiple GPUs.

Deep learning uses deep neural networks (DNNs) and large datasets to teach computers to detect recognizable concepts in data, to translate or understand natural languages, interpret information from input data, and more. Deep learning is being used in the research community and in industry to help solve many big data problems such as similarity searching, object detection, and localization. Practical examples include vehicle, pedestrian and landmark identification for driver assistance; image recognition; speech recognition; natural language processing; neural machine translation and mitosis detection.

This is a short sample clip promoting a 7 minute introduction to the DIGITS 2 deep learning training system. Watch the full-length video.

DNN Development and Deployment with DIGITS

Developing an optimized DNN is an iterative process. A data scientist may start from a popular network configuration such as “AlexNet” or create a custom network, and then iteratively modify it into a network that is well-suited for the training data. Once they have developed an effective network, data scientists can deploy it and use it on a variety of platforms, including servers or desktop computers as well as mobile and embedded devices such as Jetson TK1 or Drive PX. Figure 1 shows the overall process, broken down into two main phases: development and deployment.

Figure 1: Deep Learning Neural Network Development and Deployment Workflow Process
Figure 1: Deep Learning Neural Network Development and Deployment Workflow Process
Continue reading