Deep Learning in a Nutshell: Sequence Learning

This series of blog posts aims to provide an intuitive and gentle introduction to deep learning that does not rely heavily on math or theoretical constructs.

Be sure to check out the other Deep Learning in a Nutshell posts: Part 1, part 2.
The first part of this series provided an overview of the field of deep learning, covering fundamental and core concepts. The second part of the series provided an overview of training neural networks efficiently and gave a background on the history of the field. In this  post, we’ll look at sequence learning with a focus on natural language processing.

Figure 2: A Long Short-Term Memory (LSTM) unit. The LSTM unit has four input weights (from the data to the input and three gates) and four recurrent weights (from the output to the input and the three gates). Peepholes are extra connections between the memory cell and the gates, but they do not increase the performance by much and are often omitted for simplicity.
Figure 1: A Long Short-Term Memory (LSTM) unit. The LSTM unit has four input weights (from the data to the input and three gates) and four recurrent weights (from the output to the input and the three gates). Peepholes are extra connections between the memory cell and the gates, but they do not increase the performance by much and are often omitted for simplicity. Image by Klaus Greff and colleagues as published in LSTM: A Search Space Odyssey. Image by Klaus Greff and colleagues as published in LSTM: A Search Space Odyssey.

Sequence Learning

Everything in life depends on time and therefore, represents a sequence. To perform machine learning with sequential data (text, speech, video, etc.) we could use a regular neural network and feed it the entire sequence, but the input size of our data would be fixed, which is quite limiting. Other problems with this approach occur if important events in a sequence lie just outside of the input window. What we need is (1) a network to which we can feed sequences of arbitrary length one element of the sequence per time step (for example a video is just a sequence of images; we feed the network one image at a time); and (2) a network which has some kind of memory to remember important events which happened many time steps in the past. These problems and requirements have led to a variety of different recurrent neural networks. Continue reading


CUDA Pro Tip: Always Set the Current Device to Avoid Multithreading Bugs

We often say that to reach high performance on GPUs you should expose as much parallelism in your code as possible, and we don’t mean just parallelism within one GPU, but also across multiple GPUs and CPUs. It’s common for high-performance software to parallelize across multiple GPUs by assigning one or more CPU threads to each GPU. In this post I’ll cover a common but subtle bug and a simple rule that will help you avoid it within your own software (spoiler alert: it’s in the title!).

Let’s review how to select which GPU to execute CUDA calls on. The CUDA runtime API is state-based, and threads execute cudaSetDevice() to set the current GPU.

cudaError_t cudaSetDevice(int device)

After this call all CUDA API commands go to the current set device until cudaSetDevice() is called again with a different device ID. The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device. This is very important as it allows threads to concurrently submit work to different devices, but forgetting to set the current device in each thread can lead to subtle and hard-to-find bugs like the following example.


#pragma omp parallel

While at first glance this code may seem bug free, it is incorrect. Continue reading


Remote application development using NVIDIA® Nsight™ Eclipse Edition

NVIDIA® Nsight™ Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications for either your local (x86_64) system or a remote (x86_64 or ARM) target system. In my last post on remote development of CUDA applications, I covered NSEE’s cross compilation mode. In this post I will focus on the using NSEE’s synchronized project mode.

For remote development of CUDA applications using synchronized-project mode, you can edit code on the host system and synchronize it with the target system. In this scenario, the code is compiled natively on the target system as Figure 1 shows.

CUDA application development usage scenarios with Nsight Eclipse Edition
Figure 1: CUDA application development usage scenarios with Nsight Eclipse Edition

In synchronized project mode the host system does not need an ARM cross-compilation tool chain, so you have the flexibility to use Mac OS X or any of the CUDA supported x86_64 Linux platforms as the host system. The remote target system can be a CUDA-supported x86_64 Linux target or an ARM-based platform like the Jetson TK1 system. I am using Mac OS X 10.8.5 on my host system (with Xcode 5.1.1 installed) and 64-bit Ubuntu 12.04 on my target system. Continue reading


NVIDIA Nsight Eclipse Edition for Jetson TK1

NVIDIA® Nsight™ Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA® applications for either your local (x86) system or a remote (x86 or ARM) target. In this post, I will walk you through the process of remote-developing CUDA applications for the NVIDIA Jetson TK1, an ARM-based development kit.

Nsight supports two remote development modes: cross-compilation and “synchronize projects” mode. Cross-compiling for ARM on your x86 host system requires that all of the ARM libraries with which you will link your application be present on your host system. In synchronize-projects mode, on the other hand, your source code is synchronized between host and target systems and compiled and linked directly on the remote target, which has the advantage that all your libraries get resolved on the target system and need not be present on the host. Neither of these remote development modes requires an NVIDIA GPU to be present in your host system.

Note: CUDA cross-compilation tools for ARM are available only in the Ubuntu 12.04 DEB package of the CUDA 6 Toolkit.  If your host system is running a Linux distribution other than Ubuntu 12.04, I recommend the synchronize-projects remote development mode, which I will cover in detail in a later blog post.

CUDA toolkit setup

The first step involved in cross-compilation is installing the CUDA 6 Toolkit on your host system. To get started, let’s download the required Ubuntu 12.04 DEB package from the CUDA download page. Installation instructions can be found in the Getting Started Guide for Linux, but I will summarize them below for CUDA 6.
Continue reading


CUDACasts Episode 14: Racecheck Analysis with CUDA 5.5

The key to the power of GPUs is their 1000’s of parallel processors that execute threads. Anyone who has worked with even a handful of threads know how easy it can be to introduce race conditions, and how difficult it  can be to debug and fix these errors. Because a modern GPU can have thousands of simultaneously executing threads, NVIDIA engineers felt it was imperative to create an incredibly powerful tool for detecting and debugging race conditions.

This racecheck tool comes as part of the cuda-memcheck command-line utility. In CUDA 5.5 a new racecheck analysis mode presents much more human-readable analysis of your code, even reporting which source lines conflict with other lines. In this episode of CUDACasts we use a simple version of Conway’s Game of Life to show the new racecheck features cuda-memcheck. We’ll start with a few race condition bugs, and then use the analysis tool to find and fix them.

Continue reading


CUDACasts Episode #9: Explore GPU device memory with Nsight Eclipse Edition

Visual tools offer a very efficient method for developing and debugging applications. When working on massively parallel codes built on the CUDA Platform, this visual approach is even more important because you could be dealing with tens of thousands of parallel threads.

With the free NVIDIA Nsight Eclipse Edition IDE, you can quickly and easily examine the GPU memory state in a running CUDA C or C++ application. In today’s CUDACast, we continue our CUDA 5.5 series with a look at this new feature available to Eclipse users.

In the next few weeks, we’ll take a break from the CUDA 5.5 new feature series and explore some other topics, such as writing CUDA applications in pure Python. Stay tuned!

Continue reading


CUDACasts Episode #4: Single-GPU Debugging with CUDA 5.5

Even if you’ve already watched CUDACasts episode 3 on creating your first OpenACC program, you’ll want to go watch the new version which includes a clearer, animated introduction. So check it out!

In the next few CUDACasts we’ll be exploring some of the new features available in CUDA 5.5, which is available as a release candidate now and will be officially released very soon. Episode 4 kicks it off by demonstrating single-GPU debugging using Nsight Eclipse Edition on Linux. With this feature, it is now possible to debug a CUDA application on the same NVIDIA GPU that is driving your active display. In fact, you can debug multiple CUDA applications even while others are actively running.

Continue reading