Inference: The Next Step in GPU-Accelerated Deep Learning

At 45 images/s/W, Jetson TX1 is super efficient at deep learning inference. Read the whitepaper.
Deep learning is revolutionizing many areas of machine perception, with the potential to impact the everyday experience of people everywhere. On a high level, working with deep neural networks is a two-stage process: First, a neural network is trained: its parameters are determined using labeled examples of inputs and desired output. Then, the network is deployed to run inference, using its previously trained parameters to classify, recognize and process unknown inputs.

Deep Neural Network Training vs. Inference
Figure 1: Deep learning training compared to inference. In training, many inputs, often in large batches, are used to train a deep neural network. In inference, the trained network is used to discover information within new inputs that are fed through the network in smaller batches.

It is widely recognized within academia and industry that GPUs are the state of the art in training deep neural networks, due to both speed and energy efficiency advantages compared to more traditional CPU-based platforms. A new whitepaper from NVIDIA takes the next step and investigates GPU performance and energy efficiency for deep learning inference.

The results show that GPUs provide state-of-the-art inference performance and energy efficiency, making them the platform of choice for anyone wanting to deploy a trained neural network in the field. In particular, the NVIDIA GeForce GTX Titan X delivers between 5.3 and 6.7 times higher performance than the 16-core Intel Xeon E5 CPU while achieving 3.6 to 4.4 times higher energy efficiency. The NVIDIA Tegra X1 SoC also achieves impressive results, achieving higher performance (258 vs. 242 images/second) and much higher energy efficiency (45 vs. 3.9 images/second/Watt) than the state-of-the-art Intel Core i7 6700K. Continue reading

Figure 4. Jetson TX1 Developer Kit, including module, reference carrier and camera board.

NVIDIA® Jetson™ TX1 Supercomputer-on-Module Drives Next Wave of Autonomous Machines

Figure 1. The 50x87mm embedded Jetson TX1 module and thermal plate, featuring integrated Maxwell GPU, ARMv8 CPU, and H.265 video processor.
Figure 1. The 50x87mm embedded Jetson TX1 module and thermal plate, featuring integrated Maxwell GPU, ARMv8 CPU, and H.265 video processor.

Today NVIDIA introduced Jetson TX1, a small form-factor Linux system-on-module, destined for demanding embedded applications in visual computing.  Designed for developers and makers everywhere, the miniature Jetson TX1 (figure 1) deploys teraflop-level supercomputing performance onboard platforms in the field.  Backed by the Jetson TX1 Developer Kit, a premier developer community, and a software ecosystem including Jetpack, Linux For Tegra R23.1, CUDA Toolkit 7, cuDNN, and VisionWorks, Jetson enables machines everywhere with the proverbial brains required to achieve advanced levels of autonomy in today’s world.

Aimed at developers interested in computer vision and on-the-fly sensing, Jetson TX1’s credit-card footprint and low power consumption mean that it’s geared for deployment onboard embedded systems with constrained size, weight, and power (SWaP).  Jetson TX1 exceeds the performance of Intel’s high-end Core i7-6700K Skylake in deep learning classification with Caffe, and while drawing only a fraction of the power, achieves more than ten times the perf-per-watt.

Jetson provides superior efficiency while maintaining a developer-friendly environment for agile prototyping and product development, removing extra legwork typically associated with deploying power-limited embedded systems. Jetson TX1’s small form-factor module enables developers everywhere to deploy Tegra into embedded applications ranging from autonomous navigation to deep learning-driven inference and analytics. Continue reading

cloud gears

Accelerating Hyperscale Datacenter Applications with Tesla GPUs

The internet has changed how people consume media. Rather than just watching television and movies, the combination of ubiquitous mobile devices, massive computation, and available internet bandwidth has led to an explosion in user-created content: users are recreating the internet, producing exabytes of content every day.

Exabytes of content produced daily

Periscope, a mobile application that lets users broadcast video to followers has 10 million users who broadcast over 40 years of video per day. Twitch, a popular game broadcasting service, revealed last month that 1.7 million users have live-streamed 7.5 billion minutes of content. China’s biggest search engine, Baidu, processes 6 billion queries per day, and 10% of those queries use speech. About 300 hours of video is uploaded to YouTube every minute. And just last week, Mark Zuckerberg shared that Facebook users view 8 billion videos every day—a number that has grown by a factor of 8 in about a year.

This massive scale of content requires massive amounts of processing, and due to the volume of media content involved, datacenter workloads are changing. Increasing resources are spent on video and image processing, resizing, transcoding, filtering and enhancement. Likewise, large-scale machine learning and deep learning techniques apply trained models to what’s known as “inference”, which applies trained models to tasks such as image classification, object detection, machine translation, and speech recognition.

new_platform_new_workloads Continue reading


Deep Learning in a Nutshell: Core Concepts

DL_dog_340x340This post is the first in a series I’ll be writing for Parallel Forall that aims to provide an intuitive and gentle introduction to deep learning. It covers the most important deep learning concepts and aims to provide an understanding of each concept rather than its mathematical and theoretical details. While the mathematical terminology is sometimes necessary and can further understanding, these posts use analogies and images whenever possible to provide easily digestible bits comprising an intuitive overview of the field of deep learning.

I wrote this series in a glossary style so it can also be used as a reference for deep learning concepts.

Part 1 focuses on introducing the main concepts of deep learning. Future posts will provide historical background and delve into the training procedures, algorithms and practical tricks that are used in training for deep learning.

Core Concepts

Machine Learning

In machine learning we (1) take some data, (2) train a model on that data, and (3) use the trained model to make predictions on new data. The process of training a model can be seen as a learning process where the model is exposed to new, unfamiliar data step by step. At each step, the model makes predictions and gets feedback about how accurate its generated predictions were. This feedback, which is provided in terms of an error according to some measure (for example distance from the correct solution), is used to correct the errors made in prediction.

The learning process is often a game of back-and-forth in the parameter space: If you tweak a parameter of the model to get a prediction right, the model may have in such that it gets a previously correct prediction wrong. It may take many iterations to train a model with good predictive performance. This iterative predict-and-adjust process continues until the predictions of the model no longer improve.

Feature Engineering

Feature engineering is the art of extracting useful patterns from data that will make it easier for Machine Learning models to distinguish between classes. For example, you might take the number of greenish vs. bluish pixels as an indicator of whether a land or water animal is in some picture. This feature is helpful for a machine learning model because it limits the number of classes that need to be considered for a good classification. Continue reading


Performance Portability from GPUs to CPUs with OpenACC

OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs, and a wide range of accelerators, including APUs, GPUs, and many-core coprocessors.

Register for a free online OpenACC training course!

And starting today, with the PGI Compiler 15.10 release, OpenACC enables performance portability between accelerators and multicore CPUs. The new PGI Fortran, C and C++ compilers for the first time allow OpenACC-enabled source code to be compiled for parallel execution on either a multicore CPU or a GPU accelerator. This capability provides tremendous flexibility for programmers, enabling applications to take advantage of multiple system architectures with a single version of the source code.PGI 15.10

“Our goal is to enable HPC developers to easily port applications across all major CPU and accelerator platforms with uniformly high performance using a common source code base,” said Douglas Miles, director of PGI Compilers & Tools at NVIDIA. “This capability will be particularly important in the race towards exascale computing in which there will be a variety of system architectures requiring a more flexible application programming approach.”

OpenACC Portable PerformanceAs the chart above shows, performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups. Continue reading


Deep Learning for Computer Vision with MATLAB and cuDNN

Deep learning is becoming ubiquitous. With recent advancements in deep learning algorithms and GPU technology, we are able to solve problems once considered impossible in fields such as computer vision, natural language processing, and robotics.

Figure 1: Pet detection and recognition system.
Figure 1: Pet detection and recognition system.

Deep learning uses deep neural networks which have been around for a few decades; what’s changed in recent years is the availability of large labeled datasets and powerful GPUs. Neural networks are inherently parallel algorithms and GPUs with thousands of cores can take advantage of this parallelism to dramatically reduce computation time needed for training deep learning networks. In this post, I will discuss how you can use MATLAB to develop an object recognition system using deep convolutional neural networks and GPUs.

Why Deep Learning for Computer Vision?

Machine learning techniques use data (images, signals, text) to train a machine (or model) to perform a task such as image classification, object detection, or language translation. Classical machine learning techniques are still being used to solve challenging image classification problems. However, they don’t work well when applied directly to images, because they ignore the structure and compositional nature of images. Until recently, state-of-the-art techniques made use of feature extraction algorithms that extract interesting parts of an image as compact low-dimensional feature vectors. These were then used along with traditional machine learning algorithms.

Enter Deep learning. Deep convolutional neural networks (CNNs), a specific type of deep learning algorithm, address the gaps in traditional machine learning techniques, changing the way we solve these problems. CNNs not only perform classification, but they can also learn to extract features directly from raw images, eliminating the need for manual feature extraction. For computer vision applications you often need more than just image classification; you need state-of-the-art computer vision techniques for object detection, a bit of domain expertise, and the know-how to set up and use GPUs efficiently. Through the rest of this post, I will use an object recognition example to illustrate how easy it is to use MATLAB for deep learning, even if you don’t have extensive knowledge of computer vision or GPU programming. Continue reading


Cutting Edge Parallel Algorithms Research with CUDA

LeyuanLeyuan Wang, a Ph.D. student in the UC Davis Department of Computer Science, presented one of only two “Distinguished Papers” of the 51 accepted at Euro-Par 2015.  Euro-Par is a European conference devoted to all aspects of parallel and distributed processing held August 24-28 at Austria’s Vienna University of Technology.

Leyuan’s paper Fast Parallel Suffix Array on the GPU, co-authored by her advisor John Owens and Sean Baxter, a research scientist at New York’s DE Shaw Research, details their efforts to implement a linear-time suffix array construction algorithm on NVIDIA GPUs, resulting in algorithmic improvements and significant speedups over the existing state of the art.

Wang completed her master’s degree in electrical and computer engineering at UC Davis in October 2014, after having earned her undergraduate degree in electronics science and technology at China’s Zhejiang University.

Brad: Can you talk a bit about your current research?

Leyuan Wang: I work on high-performance string processing and graph processing algorithms, mostly in string and graph queries. My current research focus is on GPGPU (general-purpose computing on graphics processing units) and the benchmark I care about most is speed. I’ve been working on designing and improving parallel suffix array construction algorithms (SACAs) and incorporating the implementations in a Burrows-Wheeler transform-based lossless data compression (bzip2) and a parallel FM index for pattern searching. The suffix array (SA) of a string is the sorted set of all suffixes of the string. The inverse suffix array (ISA) is also the lexicographic ranks of suffixes.

The Burrows-Wheeler transform (BWT) of a string is generated by lexicographically sorting the cyclic shift of the string to form a string matrix and taking the last column of the matrix. The BWT groups repeated characters together by permuting the string; it is also reversible, which means the original string can be recovered. These two characteristics make BWT a popular choice for a compression pipeline stage (for instance, bzip2). It is directly related to the suffix array: the sorted rows in the matrix are essentially the sorted suffixes of the string and the first column of the matrix reflects a suffix array. Table 1 shows an example of the SA, ISA and BWT of the input string “banana$”

Table 1: SA, ISA and BWT for the example string “banana$”.
Table 1: SA, ISA and BWT for the example string “banana$”.

The suffix array data structure is a building block in a spectrum of applications, including data compression, bioinformatics, text indexing, etc. I’ve studied the taxonomy of all classes of SACAs and compared them in order to find the best candidate for the GPU. I revisited the previous conclusion that skew SACAs are best suited on the GPU by demonstrating that prefix-doubling SACAs are actually better both in theoretical analysis and experimental benchmarks. Our hybrid skew/prefix-doubling suffix array implementation (with our amazing research collaborator Sean Baxter, formerly of NVIDIA Research) using a Tesla K20 achieves a 7.9x speedup against the previous state-of-the-art skew implementation. Our optimized skew SACA implementation has been added as a primitive to CUDPP 2.2 (CUDA Data Parallel Primitives Library) and incorporated into the BWT and bzip2 data compression application, resulting in great speedups compared with bzip2 in CUDPP 2.1. Figure 1 shows pseudocode for our two approaches. Continue reading


Accelerating Materials Discovery with CUDA

In this post, we discuss how CUDA has facilitated materials research in the Department of Chemical and Biomolecular Engineering at UC Berkeley and Lawrence Berkeley National Laboratory. This post is a collaboration between Cory Simon, Jihan Kim, Richard L. Martin, Maciej Haranczyk, and Berend Smit.

Engineering Applications of Nanoporous Materials

Figure 1: The repeating crystal structure of metal-organic framework IRMOF-1. Atom color dictionary = {carbon: gray, oxygen: red, zinc: blue, hydrogen: white}.
Figure 1: The repeating crystal structure of metal-organic framework IRMOF-1. Atom color dictionary = {carbon: gray, oxygen: red, zinc: blue, hydrogen: white}.

Nanoporous materials have nano-sized pores such that only a few molecules can fit inside. Figure 1 shows the chemical structure of metal-organic framework IRMOF-1, just one of the many thousands of nanoporous materials that have been synthesized.

Nanoporous materials have many potential engineering applications based on gas adsorption: the process by which gas molecules adhere to a surface. In this case, the walls of the material’s pores form the surface to which gas molecules stick. Figure 2 shows the unit cell of the IRMOF-1 crystal structure and the corresponding depiction of IRMOF-1 as a raveled-up surface.

If we could unravel and flatten out the surface of IRMOF-1 in Figure 2, the surface area contained in a single gram of it could cover more than a soccer field! This provides a lot of surface area on which gas molecules can adsorb. These high surface areas are part of the reason that nanoporous materials are so promising for many engineering applications.

Figure 2: A nanoporous material can be abstracted as a raveled-up surface. On the left is the unit cell of the IRMOF-1 crystal structure. On the right is a depiction of the surface that IRMOF-1 forms.
Figure 2: A nanoporous material can be abstracted as a raveled-up surface. On the left is the unit cell of the IRMOF-1 crystal structure. On the right is a depiction of the surface that IRMOF-1 forms.

Continue reading


Open, Reproducible Computational Chemistry with Python and CUDA

SONY DSCIncreasingly, computational chemistry researchers use GPUs to push the boundaries of discovery. This motivated Christopher Cooper, an Instructor at Universidad Técnica Federico Santa María in Chile, to move to a Python-based software stack.

Cooper’s recent paper, “Probing protein orientation near charged nanosurfaces for simulation-assisted biosensor design,” was recently accepted in J. Chemical Physics.

Brad: Can you talk a bit about your current research?

Christopher: I am interested in developing fast and accurate algorithms to study the effect of electrostatics in protein systems. We use continuum models to represent the solvent around the protein (water with salt) via the Poisson-Boltzmann equation, and solve it with an accelerated boundary element method. We call the resulting code PyGBe, which is open-source software with an MIT license, and is available to download via the Github account of the research group where I did my Ph.D. at Boston University.

Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.
Figure 1: Electrostatic potential around a peptide derived from an HIV-1 capsid.

Continue reading


Customize CUDA Fortran Profiling with NVTX

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

Two previous Pro Tip posts on Parallel Forall showed how to use NVTX in CUDA C++ and MPI codes. In this post, I’ll show how to use NVTX to annotate the profiles of Fortran codes (with either CUDA Fortran or OpenACC).

NVTX has a lot of features, but here I’ll focus on using it to annotate the profiler output with timeline markers using nvtxRangePush() and nvtxRangePop(). I’ll show you how to insert markers with custom labels and colors. Continue reading