Production Deep Learning with NVIDIA GPU Inference Engine

Figure 1. NVIDIA GPU Inference Engine (GIE) provides even higher efficiency and performance for neural network inference. Tests performed using GoogLenet. CPU-only: Single-socket Intel Xeon (Haswell) E5-2698 v3@2.3GHz with HT. GPU: NVIDIA Tesla M4 + cuDNN 5 RC. GPU + GIE: NVIDIA Tesla M4 + GIE.
Figure 1. NVIDIA GPU Inference Engine (GIE) provides even higher efficiency and performance for neural network inference. Tests performed using GoogLenet. CPU-only: Single-socket Intel Xeon (Haswell) E5-2698 v3@2.3GHz with HT.
GPU: NVIDIA Tesla M4 + cuDNN 5 RC.

Today at ICML 2016, NVIDIA announced its latest Deep Learning SDK updates, including DIGITS 4, cuDNN 5.1 (CUDA Deep Neural Network Library) and the new GPU Inference Engine.

NVIDIA GPU Inference Engine (GIE) is a high-performance deep learning inference solution for production environments. Power efficiency and speed of response are two key metrics for deployed deep learning applications, because they directly affect the user experience and the cost of the service provided. GIE automatically optimizes trained neural networks for run-time performance, delivering up to 16x higher performance per watt on a Tesla M4 GPU compared to the CPU-only systems commonly used for inference  today.

Figure 1 shows GIE inference performance per watt of the relatively complex GoogLeNet running on a Tesla M4. GIE can deliver 20 Images/s/Watt on the simpler AlexNet benchmark.

In this post, we will discuss how you can use GIE to get the best efficiency and performance out of your trained deep neural network on a GPU-based deployment platform.

Deep Learning Training and Deployment

Solving a supervised machine learning problem with deep neural networks involves a two-step process.

  1. The first step is to train a deep neural network on massive amounts of labeled data using GPUs. During this step, the neural network learns millions of weights or parameters that enable it to map input data examples to correct responses. Training requires iterative forward and backward passes through the network as the objective function is minimized with respect to the network weights. Often several models are trained and accuracy is validated against data not seen during training in order to estimate real-world performance.
  2. The next step–inference–uses the trained model to make predictions from new data. During this step, the best trained model is used in an application running in a production environment such as a data center, an automobile, or an embedded platform. For some applications, such as autonomous driving, inference is done in real time and therefore high throughput is critical.

To learn more about the differences between training and inference, see Michael Andersch’s post on inference with GPUs.

The target deployment environment introduces various challenges that are typically not present in the training environment. For example, if the target is an embedded device using the trained neural network to perceive its surroundings, then the forward inference pass through the model has a direct impact on the overall response time and the power consumed by the device.  The key metric to optimize is power efficiency: the inference performance per watt.

Performance per watt is also the critical metric in maximizing data center operational efficiency.  In this scenario, the need to minimize latency and energy used on large volumes of geographically and temporally disparate requests limits the ability to form large batches.

Introducing GPU Inference Engine

GIE is a high-performance inference engine designed to deliver maximum inference throughput and efficiency for common deep learning applications such as image classification, segmentation, and object detection. GIE optimizes your trained neural networks for run-time performance and delivers GPU-accelerated inference for web/mobile, embedded and automotive applications.

Figure 2. The GPU Inference Engine Workflow. Step 1: Optimize a trained neural network. Step 2: Perform real-time inference with the GIE runtime.
Figure 2. The GPU Inference Engine Workflow.

Continue reading


Inside Pascal: NVIDIA’s Newest Computing Platform

At the 2016 GPU Technology Conference in San Jose, NVIDIA CEO Jen-Hsun Huang announced the new NVIDIA Tesla P100, the most advanced accelerator ever built. Based on the new NVIDIA Pascal GP100 GPU and powered by ground-breaking technologies, Tesla P100 delivers the highest absolute performance for HPC, technical computing, deep learning, and many computationally intensive datacenter workloads. Today at ISC 2016 in Frankfurt, NVIDIA expanded the Tesla P100 product family with the new Tesla P100 for PCIe-based servers.


In this blog post I’ll provide an overview of the Pascal architecture and its benefits to you as a developer.

At GTC, Lars Nyland and I gave a talk about details of the Tesla P100 and the Pascal GP100 architecture. The slides and recording from this talk are now available (GTC on-demand site registration required). To learn more, read the Tesla P100 white paper.

Tesla P100: Extreme Performance and Features for GPU Computing

The GP100 GPU used in Tesla P100 incorporates multiple revolutionary new features and unprecedented performance. Key features of Tesla P100 include:

  • Extreme performance—powering HPC, deep learning, and many more GPU Computing areas;
  • NVLink™—NVIDIA’s new high speed, high bandwidth interconnect for maximum application scalability;
  • HBM2—Fastest, high capacity, extremely efficient stacked GPU memory architecture;
  • Unified Memory and Compute Preemption—significantly improved programming model;
  • 16nm FinFET—enables more features, higher performance, and improved power efficiency.

Tesla P100 Available for NVLink and PCIe Servers

Tesla P100 PCIe GPU Accelerator Board
Tesla P100 PCIe GPU Accelerator Board

Tesla P100 accelerators will be available in two forms: A traditional GPU accelerator board for PCIe-based servers, and an SXM2 module for NVLink-optimized servers. P100 for PCIe-based servers allows HPC data centers to deploy the most advanced GPUs within PCIe-based nodes to support a mix of CPU and GPU workloads. P100 for NVLink-optimized servers provides the best performance and strong-scaling  for hyperscale and HPC data centers running applications that scale to multiple GPUs, such as deep learning. The table below provides the complete specifications of both Tesla P100 accelerators.

The Pascal GP100 Architecture: Faster in Every Way

With every new GPU architecture, NVIDIA introduces major improvements to performance and power efficiency. The heart of the computation in Tesla GPUs is the SM, or streaming multiprocessor. The streaming multiprocessor creates, manages, schedules and executes instructions from many threads in parallel.

Like previous Tesla GPUs, GP100 is composed of an array of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs), and memory controllers. GP100 achieves its colossal throughput by providing six GPCs, up to 60 SMs, and eight 512-bit memory controllers (4096 bits total). The Pascal architecture’s computational prowess is more than just brute force: it increases performance not only by adding more SMs than previous GPUs, but by making each SM more efficient. Each SM has 64 CUDA cores and four texture units, for a total of 3840 CUDA cores and 240 texture units.

Pascal GP100 Block Diagram
Pascal GP100 Block Diagram

Continue reading


Fast Spectral Graph Partitioning on GPUs

Graphs are mathematical structures used to model many types of relationships and processes in physical, biological, social and information systems. They are also used in the solution of various high-performance computing and data analytics problems. The computational requirements of large-scale graph processing for cyberanalytics, genomics, social network analysis and other fields demand powerful and efficient computing performance that only accelerators can provide. With CUDA 8, NVIDIA is introducing nvGRAPH, a new library of GPU-accelerated graph algorithms. Its first release, nvGRAPH 1.0, supports 3 key graph algorithms (PageRank, Single-Source Shortest Path, and Single-Source Widest Path), and our engineering and research teams are already developing new parallel algorithms for future releases. I’ll discuss one of them in detail in this blog post.

Many applications need to partition graphs into subgraphs, or to find clusters within them. For example, graph partitioning can be used in the numerical solution of partial differential equations (PDEs) to perform more efficient sparse matrix-vector multiplications, and graph clustering can be used to identify communities in social networks and for cybersecurity (see Figure 1).

Figure 1: Applications of Graph Partitioning
Figure 1: Applications of Graph Partitioning

The quality of graph partitioning or clustering can have a significant impact on the overall performance of an application. Therefore it is very important not only to find the splitting into subgraphs fast by taking advantage of GPUs (our GPU spectral graph partitioning scheme performs up to 7x faster than a CPU implementation), but also to find the best possible splitting, which requires development of new algorithms.  Continue reading


Accelerate Recommender Systems with GPUs

Wei TanWei Tan, Research Staff Member at IBM T. J. Watson Research Center shares how IBM is using NVIDIA GPUs to accelerate recommender systems, which use ratings or user behavior to recommend new products, items or content to users. Recommender systems are important in applications such as recommending products on retail sites, recommending movies or music on streaming media services, and recommending news items or posts on social media and networking services. Wei Tan’s team developed cuMF, a highly optimized matrix factorization system using CUDA to accelerate recommendations used in applications like these and more.

Brad: Can you talk a bit about your current research?

Wei: Matrix factorization (MF) is at the core of many popular algorithms, such as collaborative-filtering-based recommendation, word embedding, and topic modeling. Matrix factorization factors a sparse ratings matrix  (m-by-n, with  non-zero ratings) into a m-by-f matrix (X) and a f-by-n matrix (ΘT), as Figure 1 shows.

Figure 1. Matrix factorization factors a sparse ratings matrix R (m-by-n, with N_z non-zero ratings) into a m-by-f matrix (R) and a f-by-n matrix (Θ^T).
Figure 1. Matrix factorization factors a sparse ratings matrix R (m-by-n, with Nz non-zero ratings) into a m-by-f matrix (R) and a f-by-n matrix (ΘT ).

Suppose we obtained m users’ ratings on  items (say, movies). If user u rated item v, we use r_{uv} as the non-zero element of R at position (u, v). We want to minimize the following cost function JContinue reading


Train Your Reinforcement Learning Agents at the OpenAI Gym

Today OpenAI, a non-profit artificial intelligence research company, launched OpenAI Gym, a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Go.

John Schulman is a researcher at OpenAI.
John Schulman is a researcher at OpenAI.

OpenAI researcher John Schulman shared some details about his organization, and how OpenAI Gym will make it easier for AI researchers to design, iterate and improve their next generation applications. John studied physics at Caltech, and went to UC Berkeley for graduate school. There, after a brief stint in neuroscience, he studied machine learning and robotics under Pieter Abbeel, eventually honing in on reinforcement learning as his primary topic of interest. John lives in Berkeley, California, where he enjoys running in the hills and occasionally going to the gym.

What is OpenAI? What is your mission?

OpenAI is a non-profit artificial intelligence research company. Day to day, we are working on research projects in unsupervised learning and reinforcement learning. Our mission and long-term goal is to advance artificial intelligence in the ways that will maximally benefit humanity as a whole.

What is reinforcement learning? How is it different from supervised and unsupervised learning?


Reinforcement learning (RL) is the branch of machine learning that is concerned with making sequences of decisions. It assumes that there is an agent that is situated in an environment. At each step, the agent takes an action, and it receives an observation and reward from the environment. An RL algorithm seeks to maximize the agent’s total reward, given a previously unknown environment, through a learning process that usually involves lots of trial and error.

The reinforcement learning problem sketched above, involving a reward-maximizing agent, is extremely general, and RL algorithms have been applied in a variety of different fields. They have been applied in business management problems such as deciding how much inventory a store should hold or how it should set prices. They have also been applied to robotic control problems, and rapid development is currently occurring in this area. The following video shows Hopper: a two-dimensional one-legged robot trained to hop forward as fast as possible with OpenAI Gym.

Continue reading

Figure 5: Ring order of GPUs in PCIe tree.

Fast Multi-GPU collectives with NCCL

Today many servers contain 8 or more GPUs. In principle then, scaling an application from one to many GPUs should provide a tremendous performance boost. But in practice, this benefit can be difficult to obtain. There are two common culprits behind poor multi-GPU scaling. The first is that enough parallelism has not been exposed to efficiently saturate the processors. The second reason for poor scaling is that processors exchange too much data and spend more time communicating than computing. To avoid such communication bottlenecks, it is important to make the most of the available inter-GPU bandwidth, and this is what NCCL is all about.

NCCL (pronounced “Nickel”) is a library of multi-GPU collective communication primitives that are topology-aware and can be easily integrated into your application. Initially developed as an open-source research project, NCCL is designed to be light-weight, depending only on the usual C++ and CUDA libraries. NCCL can be deployed in single-process or multi-process applications, handling required inter-process communication transparently. Finally, the API will be very familiar to anyone with experience using MPI’s collectives.

Figure 1: Illustration of the All-Reduce collective.
Figure 1: Illustration of the All-Reduce collective.

Collective Communication

Collective communication routines are common patterns of data transfer among many processors. If you have experience with MPI, then you are probably already familiar with several collective operations. For example, all-reduce starts with independent arrays $latex V_k$ of N values on each of K processors and ends with identical arrays S of N values, where $S[k] = V_0[k]+V_1[k]+…+V_k[k]$, for each processor k. See Figure 1. Continue reading


Optimizing Recurrent Neural Networks in cuDNN 5

Figure 1: cuDNN 5 + Torch speedup vs. Torch GPU implementation, M40, Intel® Xeon® Processor E5-2698 Network A: RNN size 2560, Wordvec size 2560, num layers 1, Seq length 200, max epochs 1 Network B: RNN size 256, num layers 3, max epochs 50, batch size 64 Network C: RNN size 256, Wordvec size 256, num layers 1, Seq length 1000, max epochs 1
Figure 1: cuDNN 5 + Torch speedup vs. Torch-rnn implementation, M40, Intel® Xeon® Processor E5-2698 Network A: RNN size 2560, input size 2560, 1 layer, Seq length 200, batch size 64. Network B: RNN size 256, input size 64, 3 layers, batch size 64. Network C: RNN size 256, input size 256, 1 layer, batch size 32, Seq length 1000

This week at GTC 2016, we announced the latest update to NVIDIA Deep Learning SDK, which now includes cuDNN 5. Version 5 offers new features, improved performance and support for the latest generation NVIDIA Tesla P100 GPU. New features in cuDNN 5 include:

  • Faster forward and backward convolutions using the Winograd convolution algorithm;
  • 3D FFT Tiling;
  • Spatial Transformer Networks;
  • Improved performance and reduced memory usage with FP16 routines on Pascal GPUs;
  • Support for LSTM recurrent neural networks for sequence learning that deliver up to 6x speedup.

One of the new features we’ve added in cuDNN 5 is support for Recurrent Neural Networks (RNN). RNNs are a powerful tool used for sequence learning in a number of fields, from speech recognition to image captioning. For a brief high-level introduction to RNNs, LSTM and sequence learning, I recommend you check out Tim Dettmers recent post Deep Learning in a Nutshell: Sequence Learning, and for more depth, Soumith Chintala’s post Understanding Natural Language with Deep Neural Networks Using Torch.

I’m excited about the RNN capabilities in cuDNN 5; we’ve put a lot of effort into optimizing their performance on NVIDIA GPUs, and I’ll go into some of the details of these optimizations in this blog post.

cuDNN 5 supports four RNN modes:  ReLU activation function, tanh activation function, Gated Recurrent Units (GRU), and Long Short-Term Memory (LSTM). In this case study I’ll look at the performance of an LSTM network, but most of the optimizations can be applied to any RNN. Continue reading


CUDA 8 Features Revealed

This week at the GPU Technology Conference developers are getting a preview of some powerful new features coming in CUDA 8 later this year. In this post I wanted to give a quick overview of the major new features of CUDA 8:

  • Support for the Pascal GPU architecture, including the new Tesla P100 accelerator;
  • New Unified Memory capabilities;
  • The new nvGRAPH GPU-Accelerated Graph Analytics library;
  • Powerful new profiling capabilities; and
  • Improved compiler performance and heterogeneous lambda support.

To learn more at GTC 2016, you can check out my talk from GTC 2016, “CUDA 8 and Beyond” (GTC on-demand site registration required).

CUDA 8 Supports the new NVIDIA Pascal Architecture

A crucial goal for CUDA 8 is to provide support for the powerful new Pascal architecture, the first incarnation of which was launched at GTC today: Tesla P100. For full details on P100 and the Pascal GP100 GPU architecture, check out the blog post “Inside Pascal“.

pascal_key_imageIn a nutshell, Tesla P100 provides massive double-, single- and half-precision computational performance, 3x the memory bandwidth of Maxwell GPUs via HBM2 stacked memory, and with its support for NVLink, up to 5x the GPU-GPU communication performance of PCI Express. Pascal also improves support for Unified Memory thanks to a larger virtual address space and a new page migration engine.

CUDA 8 will enable CUDA applications to get high performance on Tesla P100 out of the box. Moreover, improvements in CUDA 8 enable developing efficient code for new Tesla P100 features such as NVLink and improved Unified Memory. Continue reading


What to Do with All That Bandwidth? GPUs for Graph and Predictive Analytics

Figure 1: Graph algorithms exhibit non-locality and data-dependent parallelism. Large graphs, such as this map of the internet, represent billion-edge challenges that push the bandwidth limits of existing hardware architectures.
Figure 1: Graph algorithms exhibit non-locality and data-dependent parallelism. Large graphs, such as this map of the internet, represent billion-edge challenges to existing hardware architectures.

Did you see the White House’s recent initiative on Precision Medicine and how it is transforming the ways we can treat cancer? Have you avoided clicking on a malicious website based on OpenDNS’s SecureRank predictive analytics? Are you using the Wikidata Query Service to gather data to use in your machine learning or deep learning application? If so, you have seen the power of graph applications. Graphs are not just nodes and links, they are powerful data structures you can use to represent complex dependencies in your data.   Graph applications are everywhere, ranging from cancer research to large-scale cyber threat detection to collaborative filtering recommendation systems.   Social networks, protein networks, and communication networks are all examples that can benefit from graph-based techniques.  

See BlazeGraph talks at GTC 2016 or visit Booth #206. Register now to save 20% with code “S7B2XS”!
The trick is that you can’t use the same techniques for graphs as for other big data challenges. Many applications make great use of the processing power of GPUs. In the world of data-intensive analytics, however, memory bandwidth—not processor speed—is the primary performance limiter. Because graph algorithms exhibit non-locality and data-dependent parallelism, when you traverse a large graph you are constantly paging from main memory. For these problems, GPUs provide superior bandwidth to memory and can deliver significant speed-ups over CPUs.

Unless you’ve been heads down performance tuning your CUDA code for the last year, you’ve seen the buzz around GPU-accelerated big data analytics. We foresee the trend of increasing memory bandwidth and the arrival of NVLink and Pascal enabling faster and easier multi-GPU computing.  So how do you apply all that bandwidth for graph analytics?

At BlazeGraph, our answer is to build software that exploits the bandwidth advantages of GPUs to accelerate graph and predictive analytics. In this blog post, I will describe our journey so far and show you where we’re heading in the future.

Our Journey to Graphs on GPUs

In 2013, we started working on the DARPA XDATA program. This lead to the development of MapGraph, a high-level API for GPU-accelerated graph analytics, in 2014. We first started using libraries like moderngpu, cub, and others in our software, which we still use today. Building on prior success in scalable graph traversal on GPUs, which showed the potential for graphs on GPUs and with DARPA funding, we collaborated with the University of Utah SCI Institute to extend MapGraph to run on multi-GPU clusters. At the IEEE Bigdata conference later in 2014 we demonstrated a breadth-first search (BFS) on a Graph 500 scale 27 (4.3 billion edges) graph at a rate of 29.1 billion traversed edges per second (GTEPS) running on a cluster of 64 NVIDIA K40 on the PSG Cluster.

Figure 2: Blazegraph results on GPU clusters from 1 to 64 GPUs.
Figure 2: Blazegraph results on GPU clusters from 1 to 64 GPUs.

Continue reading


GPUs and DSLs for Life Insurance Modeling

The Solvency II EU Directive came into effect at the beginning of the year. It harmonizes insurance regulation in the European Union with an economic and risk based approach, which considers the full balance sheet of insurers and re-insurers. In the case of life insurers and pension funds, this requires the calculation of the economic value of the liabilities—the contractual commitments the company has to meet—for long term contracts.

Calculating the economic value of the liabilities and capturing the dependence of the liabilities to different scenarios such as movements of the interest rate or changes of mortality cannot be achieved without detailed models of the underlying contracts and requires a significant computational effort.

A Perfect GPU Use Case

QuantAlea LogoThe calculations have to be executed for millions of pension and life insurance contracts and have to be performed for thousands of interest rate and mortality scenarios. Because of the level of parallelism, this is an excellent case for the application of GPUs and GPU clusters. In addition, variations in the products have to be captured. While implementing a separate code for many products is possible, a lot can be gained from abstractions at a higher level.

To solve these problem, we use the following technologies.

  1. The Actulus Modeling Language (AML), a domain specific language for actuarial modeling;
  2. Alea GPU, QuantAlea’s high performance GPU compiler for .NET C# and F#. See this previous post on Alea GPU;
  3. The modern functional-first programming language F#.

Armed with these technologies, we can significantly improve the level of abstraction and increase generality. Our system will allow actuaries to be more productive and to harness the power of GPUs without any GPU coding. The performance gain of GPU computing makes it much more practical and attractive to use proper stochastic models and to experiment with a large and diverse set of risk scenarios.

The Actulus Modeling Language

The Actulus Modeling Language (AML) is a domain-specific language for rapid prototyping in which actuaries can describe life-based pension and life insurance products, and computations on them. The idea is to write declarative AML product descriptions and from these automatically generate high-performance calculation kernels to compute reserves and cash flows under given interest rate curves and mortality curves and shocks to these. Continue reading