Accelerate Machine Learning with the cuDNN Deep Neural Network Library

Machine Learning (ML) has its origins in the field of Artificial Intelligence, which started out decades ago with the lofty goals of creating a computer that could do any work a human can do.  While attaining that goal still appears to be in the distant future, many useful tools have been developed and successfully applied to a wide variety of problems.  In fact, ML has now become a pervasive technology, underlying many modern applications.  Today the world’s largest financial companies, internet firms and foremost research institutions are using ML in applications including internet search, fraud detection, gaming, face detection, image tagging, brain mapping, check processing and computer server health-monitoring, to name a few.  The US Postal Service uses machine learning techniques for hand-writing recognition, and leading applied-research government agencies such as IARPA and DARPA are funding work to develop the next generation of ML systems.

Figure 1: :  Schematic representation of a deep neural network, showing how more complex features are captured in deeper layers.
Figure 1: : Schematic representation of a deep neural network, showing how more complex features are captured in deeper layers.

There is a wide variety of algorithms and processes for implementing ML systems. The hottest area in ML today however, is the area of Deep Neural Networks (DNNs).  The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training large, complex DNN-based ML systems. Pioneers in this area include luminaries like Geoffrey Hinton, Yann LeCun, Yoshua Bengio, and Andrew Ng.  Their success over the past 30 years has inspired a groundswell of research and development in academia, including universities such as Carnegie Mellon, NYU, Oxford, Stanford, University of California at Berkeley, University of Montreal, and the University of Toronto. More recently, many commercial enterprises have also started investing aggressively in this technology.  A few that have publicly acknowledged using GPUs with deep learning include Adobe, Baidu, Nuance, and Yandex.

Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, NVIDIA is introducing a library of primitives for deep neural networks called cuDNN.  The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, and provides other important benefits.

Machine Learning with DNNs

A ML system may be thought of as a system that learns to recognize things of interest to us, without being told explicitly what the things are ahead of time. Classic examples of such a system are the spam classifier, which scans your incoming messages and quarantines spam emails, and product recommender systems which suggest new products (books, movies, etc.) that you might like based on your prior purchases and ratings. Continue reading


3 Versatile OpenACC Interoperability Techniques

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. An often-overlooked feature of OpenACC is its ability to interoperate with the broader parallel programming ecosystem. In this post I’ll teach you 3 powerful interoperability techniques for combining OpenACC and CUDA: the host_data construct, the deviceptr clause, and the acc_map_data() API function.

OpenACC InteropI’ll demonstrate these techniques with several examples of mixing OpenACC with CUDA C++, CUDA Fortran, Thrust, and GPU-accelerated libraries. If you’d like to follow along at home, grab the source code for the examples from Github and try them out with your OpenACC compiler and the CUDA Toolkit. Don’t have an OpenACC compiler? You can download a free 30-day trial of the PGI accelerator compiler.

You may already be thinking to yourself, “If OpenACC is so great, why would I need to use it with CUDA?” OpenACC interoperability features open the door to the GPU-computing ecosystem, allowing you to leverage more than 10 years of code development. Need to multiply two matrices together? Don’t write your own function, just call the cuBLAS library, which has been heavily optimized for GPUs. Does your colleague already have a CUDA routine that you could use in your code? Use it! Interoperability means that you can always use the best tool for the job in any situation. Accelerate your application using OpenACC, but call an optimized library. Expand an existing CUDA application by adding OpenACC to unaccelerated routines. Your choice isn’t OpenACC or CUDA, it’s OpenACC and CUDA. Continue reading


Remote application development using NVIDIA® Nsight™ Eclipse Edition

NVIDIA® Nsight™ Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications for either your local (x86_64) system or a remote (x86_64 or ARM) target system. In my last post on remote development of CUDA applications, I covered NSEE’s cross compilation mode. In this post I will focus on the using NSEE’s synchronized project mode.

For remote development of CUDA applications using synchronized-project mode, you can edit code on the host system and synchronize it with the target system. In this scenario, the code is compiled natively on the target system as Figure 1 shows.

CUDA application development usage scenarios with Nsight Eclipse Edition
Figure 1: CUDA application development usage scenarios with Nsight Eclipse Edition

In synchronized project mode the host system does not need an ARM cross-compilation tool chain, so you have the flexibility to use Mac OS X or any of the CUDA supported x86_64 Linux platforms as the host system. The remote target system can be a CUDA-supported x86_64 Linux target or an ARM-based platform like the Jetson TK1 system. I am using Mac OS X 10.8.5 on my host system (with Xcode 5.1.1 installed) and 64-bit Ubuntu 12.04 on my target system. Continue reading


10 Ways CUDA 6.5 Improves Performance and Productivity

Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.


Last year we introduced CUDA on ARM, and in March we released the Jetson TK1 developer board, which enables development of CUDA on the NVIDIA Tegra K1 system-on-a-chip which includes a quad-core 32-bit ARM CPU and an NVIDIA Kepler GPU. There is a lot of excitement about developing mobile and embedded parallel computing applications on Jetson TK1. And this week at the Hot Chips conference, we provided more details about our upcoming 64-bit Denver ARM CPU architecture.

CUDA 6.5 takes the next step, enabling CUDA on 64-bit ARM platforms. The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC. ARM64 provides power efficiency, system configurability, and a large, open ecosystem. GPUs bring to the table high-throughput, power-efficient compute performance, a large HPC ecosystem, and hundreds of CUDA-accelerated applications. For HPC applications, ARM64 CPUs can offload the heavy lifting of computational tasks to GPUs. CUDA and GPUs make ARM64 competitive in HPC from day one.

Development platforms available now for CUDA on ARM64 include the Cirrascale RM1905D HPC Development Platform and the E4 ARKA EK003Eurotech has announced a system available later this year. These platforms are built on Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPU Accelerators, and CUDA 6.5. As Figure 1 shows, performance of CUDA-accelerated applications on ARM64+GPU systems is competitive with x86+GPU systems.

Figure 1: CUDA-Accelerated applications provide high performance on ARM64+GPU systems.

Continue reading


Unified Memory: Now for CUDA Fortran Programmers

Unified Memory is a CUDA feature that we’ve talked a lot about on Parallel Forall. CUDA 6 introduced Unified Memory, which dramatically simplifies GPU programming by giving programmers a single pointer to data which is accessible from either the GPU or the CPU. But this enhanced memory model has only been available to CUDA C/C++ programmers, until now. The new PGI Compiler release 14.7 enables Unified Memory in CUDA Fortran.

In a PGInsider article called CUDA Fortran Managed Memory, PGI Applications and Services Manager Brent Leback writes “using managed memory simplifies many coding tasks, makes source code cleaner, and enables a unified view of complicated data structures across host and device memories.” PGI 14.7 adds the managed keyword to the language, which you can use in host code similarly to the device keyword. Here’s part of an example Brent included in his article, showing the allocation of managed arrays. Continue reading


Accelerate R Applications with CUDA

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through CRAN (the Comprehensive R Archive Network).

However, R, like many other high-level languages, is not performance competitive out of the box with lower-level languages like C++, especially for highly data- and computation-intensive applications. R programs tend to process large amounts of data, and often have significant independent data and task parallelism. Therefore, R applications stand to benefit from GPU acceleration. This way, R users can benefit from R’s high-level, user-friendly interface while achieving high performance.

In this article, I will introduce the computation model of R with GPU acceleration, focusing on three topics:

  • accelerating R computations using CUDA libraries;
  • calling your own parallel algorithms written in CUDA C/C++ or CUDA Fortran from R; and
  • profiling GPU-accelerated R applications using the CUDA Profiler.

The GPU-Accelerated R Software Stack

Figure 1 shows that there are two ways to apply the computational power of GPUs in R:

  1. use R GPU packages from CRAN; or
  2. access the GPU through CUDA libraries and/or CUDA-accelerated programming languages, including C, C++ and Fortran.
modelFigure 1: The R + GPU software stack.

The first approach is to use existing GPU-accelerated R packages listed under High-Performance and Parallel Computing with R on the CRAN site. Examples include gputools and cudaBayesreg. These packages are very easy to install and use. On the other hand, the number of GPU packages is currently limited, quality varies, and only a few domains are covered. This will improve with time.

The second approach is to use the GPU through CUDA directly. Continue reading


Calling CUDA-accelerated Libraries from MATLAB: A Computer Vision Example

In an earlier post we showed how MATLAB® can support CUDA kernel prototyping and development by providing an environment for quick evaluation and visualization using the CUDAKernel object. In this post I will show you how to integrate an existing library of both host and device code implemented in C++ or another CUDA-accelerated language using MEX. With MEX you can extend and customize MATLAB, or use MATLAB as a test environment for your production code.

The MATLAB MEX compiler allows you to expose your libraries to the MATLAB environment as functions. You write your entry point in C, C++ or Fortran as a modified main() function which MATLAB invokes. MEX provides a framework for compiling this code, as well as an API for interacting with MATLAB and MATLAB data in your source code.

MATLAB’s Parallel Computing Toolbox™ provides constructs for compiling CUDA C and C++ with nvcc, and new APIs for accessing and using the gpuArray datatype which represents data stored on the GPU as a numeric array in the MATLAB workspace.

Feature Detection Example

Figure 1: Color composite of frames from a video feature tracking example. (Frame A = red, frame B = cyan)
Figure 1: Color composite of frames from a video feature tracking example. (Frame A = red, frame B = cyan)

I am going to use a feature detection example from MATLAB’s documentation for Computer Vision System Toolbox™. This uses tracked features to remove camera shake from an in-car road video. You will need MATLAB®, Parallel Computing Toolbox™, Image Processing Toolbox™ and Computer Vision System Toolbox™ to run the code. You can request a trial of these products at This example also depends on the OpenCV Computer Vision library, compiled with CUDA support.

Features are an essential prerequisite for many Computer Vision tasks; in this case, for instance, they might also be used to determine the motion of the car or to track other cars on the road.

To set up the example environment, I am using the following MATLAB code to load the video and display the first two frames superimposed (Figure 1). Continue reading


Accelerating Graph Betweenness Centrality with CUDA

Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph algorithms involve tremendously large networks that cannot be inspected manually. Betweenness Centrality (BC) is a popular analytic that determines vertex influence in a graph. It has many practical use cases, including finding the best locations for stores within cities, power grid contingency analysis, and community detection. Unfortunately, the fastest known algorithm for computing betweenness centrality has O(mn) time complexity for graphs with n vertices and m edges, making the analysis of large networks challenging.

This post describes how we used CUDA and NVIDIA GPUs to accelerate the BC computation, and how choosing efficient parallelization strategies results in an average speedup of 2.7x, and more than 10x speedup for road networks and meshes versus a naïve edge-parallel strategy.

Example Betweenness Centrality scores for a small graph
Fig. 1. Example Betweenness Centrality scores for a small graph

Betweenness Centrality determines the importance of vertices in a network by measuring the ratio of shortest paths passing through a particular vertex to the total number of shortest paths between all pairs of vertices. Intuitively, this ratio determines how well a vertex connects pairs of vertices in the network. Formally, the Betweenness Centrality of a vertex v is defined as:

BC(v) = \sum_{s \neq t \neq v} \frac{\sigma_{st}(v)}{\sigma_{st}}

where \sigma_{st} is the number of shortest paths between vertices s and t and \sigma_{st}(v) is the number of those shortest paths that pass through v. Consider Figure 1 above. Vertex 4 is the only vertex that lies on paths from its left (vertices 5 through 9) to its right (vertices 1 through 3). Hence vertex 4 lies on all the shortest paths between these pairs of vertices and has a high BC score. In contrast, vertex 9 does not belong on a path between any pair of the remaining vertices and thus it has a BC score of 0. Continue reading

Figure 4: MMTI and trainable HoG pedestrian/vehicle detectors extract dynamic obstacles from HD video at runtime

Low-Power Sensing and Autonomy With NVIDIA Jetson TK1

Figure 1: simple TK1 block diagram
Figure 1: simple TK1 block diagram

NVIDIA’s Tegra K1 (TK1) is the first ARM system-on-chip (SoC) with integrated CUDA.  With 192 Kepler GPU cores and four ARM Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance, TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power (including the SoC and DRAM).  This brings game-changing performance to low-SWaP (Size, Weight and Power) and small form factor (SFF) applications in the sub-10W domain, all the while supporting a developer-friendly Ubuntu Linux software environment delivering an experience more like that of a desktop rather than an embedded SoC.

Tegra K1 is plug-and-play and can stream high-bandwidth peripherals, sensors, and network interfaces via built-in USB 3.0 and PCIe gen2 x4/x1 ports.  TK1 is geared for sensor processing and offers additional hardware-accelerated functionality asynchronous to CUDA, like H.264 encoding and decoding engines and dual MIPI CSI-2 camera interfaces and image service processors (ISP).  There are many exciting embedded applications for TK1 which leverage its natural ability as a media processor and low-power platform for quickly integrating devices and sensors.

As GPU acceleration is particularly well-suited for data-parallel tasks like imaging, signal processing, autonomy and machine learning, Tegra K1 extends these capabilities into the sub-10W domain.  Code portability is now maintained from NVIDIA’s high-end Tesla HPC accelerators and the GeForce and Quadro discrete GPUs, all the way down through the low-power TK1.   A full build of the CUDA 6 toolkit is available for TK1, including samples, math libraries such as cuFFT, cuBLAS, and NPP, and NVIDIA’s NVCC compiler.  Developers can compile CUDA code natively on TK1 or cross-compile from a Linux development machine.  Availability of the CUDA libraries and development tools ensures seamless and effortless scalability between deploying CUDA applications on discrete GPUs and on Tegra.  There’s also OpenCV4Tegra available as well as NVIDIA’s VisionWorks toolkit.  Additionally the Ubuntu 14.04 repository is rich in pre-built packages for the ARM architecture, minimizing time spent tracking down and building dependencies.  In many instances applications can be simply recompiled for ARM with little modification, as long as source is available and doesn’t explicitly call out x86-specific instructions like SSE, AVX, or x86-ASM. NEON is ARM’s version of SIMD extensions for Cortex-A series CPUs.
Continue reading


A CUDA Dynamic Parallelism Case Study: PANDA

This post concludes an introductory series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. The second post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. In this post, I finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment part of the (Facility for Antiproton and Ion Research in Europe (FAIR)). The PANDA work was carried out in the scope of the NVIDIA Application Lab at Jülich.

The PANDA Experiment

PANDA (= anti-Proton ANnihilation at DArmstadt) is a state-of-the-art hadron particle physics experiment currently under construction at FAIR (Facility for Anti-proton and Ion Research) at Darmstadt. It is scheduled to start operation in 2019.

Figure 1: The PANDA experiment and its detectors. Image courtesy of PANDA Collaboration.

Inside the PANDA experiment, accelerated antiprotons will collide with protons, forming intermediate and unstable particles (mesons, baryons etc.), which will decay in cascades into stable particles, like electrons and photons. The unstable particles are of particular interest for PANDA, as they give insight into the processes governing this physics regime (QCD). Reconstructing all involved constituent particles of an event lets the physicists form a picture of the process, eventually confirming established physics theories, probing new ones and potentially finding exciting and unexpected results.
Continue reading