cuDNN_logo_black_on_white_179x115

Accelerate Machine Learning with the cuDNN Deep Neural Network Library

Machine Learning (ML) has its origins in the field of Artificial Intelligence, which started out decades ago with the lofty goals of creating a computer that could do any work a human can do.  While attaining that goal still appears to be in the distant future, many useful tools have been developed and successfully applied to a wide variety of problems.  In fact, ML has now become a pervasive technology, underlying many modern applications.  Today the world’s largest financial companies, internet firms and foremost research institutions are using ML in applications including internet search, fraud detection, gaming, face detection, image tagging, brain mapping, check processing and computer server health-monitoring, to name a few.  The US Postal Service uses machine learning techniques for hand-writing recognition, and leading applied-research government agencies such as IARPA and DARPA are funding work to develop the next generation of ML systems.

Figure 1: :  Schematic representation of a deep neural network, showing how more complex features are captured in deeper layers.
Figure 1: : Schematic representation of a deep neural network, showing how more complex features are captured in deeper layers.

There is a wide variety of algorithms and processes for implementing ML systems. The hottest area in ML today however, is the area of Deep Neural Networks (DNNs).  The success of DNNs has been greatly accelerated by using GPUs, which have become the platform of choice for training large, complex DNN-based ML systems. Pioneers in this area include luminaries like Geoffrey Hinton, Yann LeCun, Yoshua Bengio, and Andrew Ng.  Their success over the past 30 years has inspired a groundswell of research and development in academia, including universities such as Carnegie Mellon, NYU, Oxford, Stanford, University of California at Berkeley, University of Montreal, and the University of Toronto. More recently, many commercial enterprises have also started investing aggressively in this technology.  A few that have publicly acknowledged using GPUs with deep learning include Adobe, Baidu, Nuance, and Yandex.

Because of the increasing importance of DNNs in both industry and academia and the key role of GPUs, NVIDIA is introducing a library of primitives for deep neural networks called cuDNN.  The cuDNN library makes it easy to obtain state-of-the-art performance with DNNs, and provides other important benefits.

Machine Learning with DNNs

A ML system may be thought of as a system that learns to recognize things of interest to us, without being told explicitly what the things are ahead of time. Classic examples of such a system are the spam classifier, which scans your incoming messages and quarantines spam emails, and product recommender systems which suggest new products (books, movies, etc.) that you might like based on your prior purchases and ratings. Continue reading

cuda_pro_tip

CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule that will help you avoid it within your own software (spoiler alert: it’s in the title!).

Let’s review how to select which GPU to execute CUDA calls on. The CUDA runtime API is state-based, and threads execute cudaSetDevice() to set the current GPU.

cudaError_t cudaSetDevice(int device)

After this call all CUDA API commands go to the current set device until cudaSetDevice() is called again with a different device ID. The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device. This is very important as it allows threads to concurrently submit work to different devices, but forgetting to set the current device in each thread can lead to subtle and hard-to-find bugs like the following example.

cudaSetDevice(1);
cudaMalloc(&a,bytes);

#pragma omp parallel
{
  kernel<<<blocks,threads>>>(a);
}

While at first glance this code may seem bug free, it is incorrect. The problem here is that we have set device 1 current on the OpenMP master thread but then used OpenMP to spawn more threads which will use the default device (device 0) because they never call cudaSetDevice(). This code would actually launch multiple kernels that run on device 0 but access memory allocated on device 1. This will cause either invalid memory access errors or (in the case where peer-to-peer access is enabled) it will be limited by low PCIe memory bandwidth to the array a.

Here is a correct implementation of the code, where every thread sets the correct device.

cudaSetDevice(1);
cudaMalloc(&a,bytes);

#pragma omp parallel
{
  cudaSetDevice(1);
  kernel<<<blocks,threads>>>(a);
}

If it’s not obvious from the title of this post, there’s a simple rule to follow to avoid bugs like this…

Always Set the Device in New Host Threads

Make it a habit to call cudaSetDevice() wherever your code could potentially spawn new host threads. The following example has a potential bug depending on whether the OpenMP library chooses to spawn new threads or reuse old ones.

cudaSetDevice(1);
cudaMalloc(&a,bytes);

#pragma omp parallel
{
  cudaSetDevice(1);
  kernel<<<blocks,threads>>>(a);
}

#pragma omp parallel
{
    kernel<<<blocks,threads>>>(a);
}

In this example, threads in the second omp parallel region don’t set the current device so there is no guarantee that it is set for each thread. This problem is not restricted to OpenMP; it can easily happen with any threading library, and in any CUDA-accelerated Language.

To save yourself from a variety of multithreading bugs, remember: always call cudaSetDevice() first when you spawn a new host thread.

openacc-logo-thumb

3 Versatile OpenACC Interoperability Techniques

OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. An often-overlooked feature of OpenACC is its ability to interoperate with the broader parallel programming ecosystem. In this post I’ll teach you 3 powerful interoperability techniques for combining OpenACC and CUDA: the host_data construct, the deviceptr clause, and the acc_map_data() API function.

OpenACC InteropI’ll demonstrate these techniques with several examples of mixing OpenACC with CUDA C++, CUDA Fortran, Thrust, and GPU-accelerated libraries. If you’d like to follow along at home, grab the source code for the examples from Github and try them out with your OpenACC compiler and the CUDA Toolkit. Don’t have an OpenACC compiler? You can download a free 30-day trial of the PGI accelerator compiler.

You may already be thinking to yourself, “If OpenACC is so great, why would I need to use it with CUDA?” OpenACC interoperability features open the door to the GPU-computing ecosystem, allowing you to leverage more than 10 years of code development. Need to multiply two matrices together? Don’t write your own function, just call the cuBLAS library, which has been heavily optimized for GPUs. Does your colleague already have a CUDA routine that you could use in your code? Use it! Interoperability means that you can always use the best tool for the job in any situation. Accelerate your application using OpenACC, but call an optimized library. Expand an existing CUDA application by adding OpenACC to unaccelerated routines. Your choice isn’t OpenACC or CUDA, it’s OpenACC and CUDA. Continue reading

CUDACasts_FeaturedImage

CUDACasts Episode 20: Getting started with Jetson TK1 and OpenCV

TK1_Dev_Kit-6350-GREEN-V4_280The Jetson TK1 development kit has fast become a must-have for mobile and embedded parallel computing due the amazing level of performance packed into such a low-power board. In this and the following CUDACast, you’ll learn how to get started building computer vision applications on your Jetson TK1 using CUDA and the OpenCV library.

CUDACasts are short how-to screencast videos about new features and techniques for GPU programming. Click here for all CUDACasts.

Continue reading

nsight_esclipse_logo

Remote application development using NVIDIA® Nsight™ Eclipse Edition

NVIDIA® Nsight™ Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications for either your local (x86_64) system or a remote (x86_64 or ARM) target system. In my last post on remote development of CUDA applications, I covered NSEE’s cross compilation mode. In this post I will focus on the using NSEE’s synchronized project mode.

For remote development of CUDA applications using synchronized-project mode, you can edit code on the host system and synchronize it with the target system. In this scenario, the code is compiled natively on the target system as Figure 1 shows.

CUDA application development usage scenarios with Nsight Eclipse Edition
Figure 1: CUDA application development usage scenarios with Nsight Eclipse Edition

In synchronized project mode the host system does not need an ARM cross-compilation tool chain, so you have the flexibility to use Mac OS X or any of the CUDA supported x86_64 Linux platforms as the host system. The remote target system can be a CUDA-supported x86_64 Linux target or an ARM-based platform like the Jetson TK1 system. I am using Mac OS X 10.8.5 on my host system (with Xcode 5.1.1 installed) and 64-bit Ubuntu 12.04 on my target system. Continue reading

cuda_spotlight

CUDA Spotlight: Michela Taufer on GPU-Accelerated Scientific Computing

TauferMichela_112112Our Spotlight is on Dr. Michela Taufer, Associate Professor at the University of Delaware.

Michela heads the Global Computing Lab (GCLab), which focuses on high performance computing (HPC) and its application to the sciences.

Her research interests include software applications and their advanced programmability in heterogeneous computing (i.e., multi-core platforms and GPUs); cloud computing and volunteer computing; and performance analysis, modeling and optimization of multi-scale applications.

The following is an excerpt from our interview (read the complete Spotlight here).
_____________________________________________________

NVIDIA: Michela, what is the mission of the Global Computing Lab at the University of Delaware?
Michela: We are engaged in the design and testing of efficient computational algorithms and adaptive scheduling policies for scientific computing on GPUs, the Cloud, and Volunteer Computing.

Interdisciplinary research with scientists and engineers in fields such as chemistry and chemical engineering, pharmaceutical sciences, seismology, and mathematics is at the core of our activities and philosophy.

NVIDIA: Tell us about your work with GPUs.
Michela: My team’s work is all about rethinking application algorithms to fit on the GPU architecture in order to get the most out of its computing power, while preserving the scientific accuracy of the simulations. This has resulted in many exciting achievements!

NVIDIA: Can you provide an example?
Michela: My group and I were the first to propose a completely-on-GPU PME (Particle Mesh Ewald) code for MD (molecular dynamics) simulations. We achieved that goal by changing the traditional way researchers algorithmically look at charges in long-range electrostatics and their interactions.

With our code empowered with the PME components, we could move the traditional scale for studying membranes like DMPC lipid bilayers from membranes on the order of 72 lipid molecules (17,004 atoms) to 16-times larger membranes of 1,152 lipid molecules (27,3936 atoms) in explicit solvent [see Figure 1].

Figure 1: Visual representations of the lipid-bilayer systems. The DMPC 1x1 system describes the small system of 72 lipid molecules (36 lipids/leaflet) traditionally used for simulations on high-end clusters. DMPC 2x2 and 4x4 describe systems with 288 and 1152 lipid molecules, respectively, that we were able to study on a single GPU. Presented in Structural, Dynamic, and Electrostatic Properties of Fully Hydrated DMPC Bilayers from Molecular Dynamics Simulations Accelerated with GPUs.
Figure 1: Visual representations of the lipid-bilayer systems.
The DMPC 1×1 system describes the small system of 72 lipid molecules (36 lipids/leaflet) traditionally used for simulations on high-end clusters. DMPC 2×2 and 4×4 describe systems with 288 and 1152 lipid molecules, respectively, that we were able to study on a single GPU. Presented in Structural, Dynamic, and Electrostatic Properties of Fully Hydrated DMPC Bilayers from Molecular Dynamics Simulations Accelerated with GPUs.

Continue reading

cuda6_5

10 Ways CUDA 6.5 Improves Performance and Productivity

Today we’re excited to announce the release of the CUDA Toolkit version 6.5. CUDA 6.5 adds a number of features and improvements to the CUDA platform, including support for CUDA Fortran in developer tools, user-defined callback functions in cuFFT, new occupancy calculator APIs, and more.

CUDA on ARM64

Last year we introduced CUDA on ARM, and in March we released the Jetson TK1 developer board, which enables development of CUDA on the NVIDIA Tegra K1 system-on-a-chip which includes a quad-core 32-bit ARM CPU and an NVIDIA Kepler GPU. There is a lot of excitement about developing mobile and embedded parallel computing applications on Jetson TK1. And this week at the Hot Chips conference, we provided more details about our upcoming 64-bit Denver ARM CPU architecture.

CUDA 6.5 takes the next step, enabling CUDA on 64-bit ARM platforms. The heritage of ARM64 is in low-power, scale-out data centers and microservers, while GPUs are built for ultra-fast compute performance. When we combine the two, we have a compelling solution for HPC. ARM64 provides power efficiency, system configurability, and a large, open ecosystem. GPUs bring to the table high-throughput, power-efficient compute performance, a large HPC ecosystem, and hundreds of CUDA-accelerated applications. For HPC applications, ARM64 CPUs can offload the heavy lifting of computational tasks to GPUs. CUDA and GPUs make ARM64 competitive in HPC from day one.

Development platforms available now for CUDA on ARM64 include the Cirrascale RM1905D HPC Development Platform and the E4 ARKA EK003Eurotech has announced a system available later this year. These platforms are built on Applied Micro X-Gene 8-core 2.4GHz ARM64 CPUs, Tesla K20 GPU Accelerators, and CUDA 6.5. As Figure 1 shows, performance of CUDA-accelerated applications on ARM64+GPU systems is competitive with x86+GPU systems.

ARM64_perf
Figure 1: CUDA-Accelerated applications provide high performance on ARM64+GPU systems.

Continue reading

cuda_pro_tip

CUDA Pro Tip: Optimize for Pointer Aliasing

Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application performance.

What is pointer aliasing?

Two pointers alias if the memory to which they point overlaps. When a compiler can’t determine whether pointers alias, it has to assume that they do. The following simple function shows why this is potentially harmful to performance:

void example1(float *a, float *b, float *c, int i) {
  a[i] = a[i] + c[i];
  b[i] = b[i] + c[i];
}

At first glance it might seem that this function needs to perform three load operations from memory: one for a[i], one for b[i] and one for c[i]. This is incorrect because it assumes that c[i] can be reused once it is loaded. Consider the case where a and c point to the same address. In this case the first line modifies the value c[i] when writing to a[i]. Therefore the compiler must generate code to reload c[i] on the second line, in case it has been modified.

Because the compiler must conservatively assume the pointers alias, it will compile the above code inefficiently, even if the programmer knows that the pointers never alias.

What can I do about aliasing?

Fortunately almost all C/C++ compilers offer a way for the programmer to give the compiler information about pointer aliasing. Continue reading

Rlogo

Accelerate R Applications with CUDA

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through CRAN (the Comprehensive R Archive Network).

However, R, like many other high-level languages, is not performance competitive out of the box with lower-level languages like C++, especially for highly data- and computation-intensive applications. R programs tend to process large amounts of data, and often have significant independent data and task parallelism. Therefore, R applications stand to benefit from GPU acceleration. This way, R users can benefit from R’s high-level, user-friendly interface while achieving high performance.

In this article, I will introduce the computation model of R with GPU acceleration, focusing on three topics:

  • accelerating R computations using CUDA libraries;
  • calling your own parallel algorithms written in CUDA C/C++ or CUDA Fortran from R; and
  • profiling GPU-accelerated R applications using the CUDA Profiler.

The GPU-Accelerated R Software Stack

Figure 1 shows that there are two ways to apply the computational power of GPUs in R:

  1. use R GPU packages from CRAN; or
  2. access the GPU through CUDA libraries and/or CUDA-accelerated programming languages, including C, C++ and Fortran.
modelFigure 1: The R + GPU software stack.

The first approach is to use existing GPU-accelerated R packages listed under High-Performance and Parallel Computing with R on the CRAN site. Examples include gputools and cudaBayesreg. These packages are very easy to install and use. On the other hand, the number of GPU packages is currently limited, quality varies, and only a few domains are covered. This will improve with time.

The second approach is to use the GPU through CUDA directly. Continue reading

matlab_logo

Calling CUDA-accelerated Libraries from MATLAB: A Computer Vision Example

In an earlier post we showed how MATLAB® can support CUDA kernel prototyping and development by providing an environment for quick evaluation and visualization using the CUDAKernel object. In this post I will show you how to integrate an existing library of both host and device code implemented in C++ or another CUDA-accelerated language using MEX. With MEX you can extend and customize MATLAB, or use MATLAB as a test environment for your production code.

The MATLAB MEX compiler allows you to expose your libraries to the MATLAB environment as functions. You write your entry point in C, C++ or Fortran as a modified main() function which MATLAB invokes. MEX provides a framework for compiling this code, as well as an API for interacting with MATLAB and MATLAB data in your source code.

MATLAB’s Parallel Computing Toolbox™ provides constructs for compiling CUDA C and C++ with nvcc, and new APIs for accessing and using the gpuArray datatype which represents data stored on the GPU as a numeric array in the MATLAB workspace.

Feature Detection Example

Figure 1: Color composite of frames from a video feature tracking example. (Frame A = red, frame B = cyan)
Figure 1: Color composite of frames from a video feature tracking example. (Frame A = red, frame B = cyan)

I am going to use a feature detection example from MATLAB’s documentation for Computer Vision System Toolbox™. This uses tracked features to remove camera shake from an in-car road video. You will need MATLAB®, Parallel Computing Toolbox™, Image Processing Toolbox™ and Computer Vision System Toolbox™ to run the code. You can request a trial of these products at www.mathworks.com/trial. This example also depends on the OpenCV Computer Vision library, compiled with CUDA support.

Features are an essential prerequisite for many Computer Vision tasks; in this case, for instance, they might also be used to determine the motion of the car or to track other cars on the road.

To set up the example environment, I am using the following MATLAB code to load the video and display the first two frames superimposed (Figure 1). Continue reading