Analysis of statistical algorithms can generate workloads that run for hours, if not days, tying up a single computer. Many statisticians and data scientists write complex simulations and statistical analysis using the R statistical computing environment. Often these programs have a very long run time. Given the amount of time R programmers can spend waiting for results, it makes sense to take advantage of parallelism in the computation and the available hardware.

In a previous post on the Teraproc blog, I discussed the value of parallelism for long-running R models, and showed how multi-core and multi-node parallelism can reduce run times. In this blog I’ll examine another way to leverage parallelism in R, harnessing the processing cores in a general-purpose graphics processing unit (GPU) to dramatically accelerate commonly used clustering algorithms in R. The most widely used GPUs for GPU computing are the NVIDIA Tesla series. A Tesla K40 GPU has 2,880 integrated cores, 12 GB of memory with 288 GB/sec of bandwidth delivering up to 5 trillion floating point calculations per second.

The examples in this post build on the excellent work of Mr. Chi Yau available at r-tutor.com. Chi is the author of the CRAN open-source `rpud`

package as well as `rpudplus`

, R libraries that make is easy for developers to harness the power of GPUs without programming directly in CUDA C++. To learn more about R and parallel programming with GPUs you can download Chi’s e-book. For illustration purposes, I’ll focus on an example involving distance calculations and hierarchical clustering, but you can use the rpud package to accelerate a variety of applications.

Cluster analysis, or clustering, is the process of grouping objects such that objects in the same cluster are more similar (by a given metric) to each other than to objects in other clusters. Cluster analysis is a problem with significant parallelism. In a post on the Teraproc blog we showed an example that involved clustering analysis using *k*-means. In this post we’ll look at hierarchical cluster in R with `hclust`

, a function that makes it simple to create a dendrogram (a tree diagram as in Figure 1) based on differences between observations. This type of analysis is useful in all kinds of applications from taxonomy to cancer research to time-series analysis of financial data.

Similar to our *k*-means example, grouping observations in a hierarchical fashion depends on being able to quantify the differences (or distances) between observations. This means calculating the Euclidean distance between pairs of observations (think of this as the Pythagorean Theorem extended to more dimensions). Chi Yau explains this in his two posts Distance Matrix by GPU and Hierarchical Cluster Analysis, so we won’t attempt to cover all the details here.

R’s `hclust`

function accepts a matrix of previously computed distances between observations. The `dist`

function in R computes the difference between rows in a dataset supporting multiple methods including Euclidean distance (the default). If I have a set of *M* observations (rows), each with *N* attributes (columns), for each distance calculation I need to compute the length of a vector in *N*-dimensional space between the observations. There are discrete distance calculations between all rows. Thus, computation scales as the square of the number of observations: for 10 observations I need 45 distance calculations, for 100 observations I need 4,950, and for 100,000 observations I need 4,999,950,000 (almost 5 billion) distance calculations. As you can see, `dist`

can get expensive for large datasets.

Before I can start running applications, first I need access to a system with a GPU. Fortunately, for a modest price I can rent a machine with GPUs for a couple of hours. In preparing this blog I used two hours of machine time on the Teraproc R Analytics Cluster-as-a-Service. The service leverages Amazon EC2 and the total cost for machine time was $1.30; quite a bit cheaper than setting up my own machine! The reason I was able to use so little time is because the process of installing the cluster is fully automated by the Teraproc service. Teraproc’s R Cluster-as-a-Service provides CUDA, R, R Studio and other required software components pre-installed and ready to use. OpenLava and NFS are also configured automatically, giving me the option to extend the cluster across many GPU capable compute nodes and optionally use Amazon spot pricing to cut costs.

I deployed a one-node cluster on Teraproc.com using the Amazon g2.2xlarge machine type as shown below. I could have installed the g2.2xlarge instance myself from the Amazon EC2 console, but in this case I would have needed to install R, R Studio and configure the environment myself resulting in spending more time and money. You can learn how to set up an R cluster yourself on different node types (including free machines) at the Teraproc R Analytics Cluster-as-a-Service website. If you already have an Amazon EC2 account you can set up a cluster in as little as five minutes.

The g2.2xlarge machine instance is a Sandy Bridge based machine with 8 cores / vCPUs on a Xeon E5-2670 processor, 15GB or memory, a solid state disk drive and an NVIDIA GRID K520 GPU. The on-demand price for this machine is $0.65 per hour. The NVIDIA GRID K520 has two GK104 graphics processors each with 1,536 cores on a single card with 8 GB of RAM.

First we use the teraproc.com R-as-a-cluster service to provision the R environment making sure that we select the correct machine type (G22xlarge) and install a one-node cluster, as Figure 1 shows. This automatically deploys a single-node cluster complete with R Studio and provides us with a URL to access the R Studio environment.

Using the shell function within R Studio (under the Tools menu), I can run an operating system command to make sure that the GPU is present on the machine.

gordsissons@ip-10-0-93-199:~$ lspci | grep -i nvidia 00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)

To use the `rpud`

package to access GPU functions we need to install it first. Run the command below from inside R-Studio to install `rpud`

from CRAN.

> install.packages(“rpud”)

Next, use the `library`

command to access `rpud`

functions.

> library(rpud) Rpudplus 0.5.0 http://www.r-tutor.com Copyright (C) 2010-2015 Chi Yau. All Rights Reserved. Rpudplus is free for academic use only. There is absolutely NO warranty.

If everything is working properly, we should be able to see the GPU on our Amazon instance from the R command prompt by calling `rpuGetDevice`

.

> rpuGetDevice() GRID K520 GPU [1] 0

The following listing shows a sample R program that compares performance of a hierarchical clustering algorithm with and without GPU acceleration. The first step is to create a suitable dataset where we can control the number of observations (rows) as well as the number of dimensions (columns) for each observation. The `test.data`

function returns a matrix of random values based on the number of rows and columns provided.

The `run_cpu`

function calculates all the distances between the observations (rows) using R’s `dist`

function, and then runs R’s native `hclust`

function against the computed distances stored in `dcpu`

to create a dendrogram. The `run_gpu`

function performs exactly the same computations using the GPU-optimized versions of `dist`

and `hclust`

(`rpuDist`

and `rpuHclust`

) from the `rpud`

package.

The R script creates a matrix *m* of a particular size by calling `test.data`

and then measures and displays the time required to create a hierarchical cluster using both the CPU and GPU functions.

library("rpud") # # function to populate a data matrix # test.data <- function(dim, num, seed=10) { set.seed(seed) matrix(rnorm(dim * num), nrow=num) } run_cpu <- function(matrix) { dcpu <- dist(matrix) hclust(dcpu) } run_gpu <- function(matrix) { dgpu <- rpuDist(matrix) rpuHclust(dgpu) } # # create a matrix with 20,000 observations each with 100 data elements # m <- test.data(100, 20000) # # Run dist and hclust to calculate hierarchical clusters using CPU # print("Calculating hclust with Sandy Bridge CPU") print(system.time(cpuhclust <-run_cpu(m))) # # Run dist and hclust to calculate hierarchical clusters using GPU # print("Calculating hclust with NVIDIA K520 GPU") print(system.time(gpuhclust <- run_gpu(m)))

Running the script yields the following results:

>source('~/examples/rgpu_hclust.R') [1] "Calculating hclust with Sandy Bridge CPU" user system elapsed 294.760 0.746 295.314 [1] "Calculating hclust with NVIDIA K520 GPU" user system elapsed 19.285 3.160 22.431

To explore the GPU vs. CPU speedup, we ran the script on datasets with a varying number of rows and plotted the results. The distance calculation is highly parallel on the GPU, while much of the GPU-optimized `hclust`

calculation runs on the CPU. For this reason the calculation scales well as the dataset gets larger and the time required for the distance calculations dominates.

Number of rows | Number of dimensions | Total elements | # distance calculations | CPU time (seconds) | GPU time (seconds) | Speed-up |

1,000 | 100 | 100,000 | 1,998,000 | 0.50 | 0.04 | 11.8 |

2,000 | 100 | 200,000 | 7,996,000 | 2.06 | 0.17 | 12.1 |

5,000 | 100 | 500,000 | 49,990,000 | 13.42 | 1.17 | 11.5 |

10,000 | 100 | 1,000,000 | 199,980,000 | 59.83 | 5.03 | 11.9 |

15,000 | 100 | 1,500,000 | 449,970,000 | 141.15 | 11.61 | 12.2 |

20,000 | 100 | 2,000,000 | 799,960,000 | 295.31 | 22.43 | 13.2 |

Looking at the run times side by side, we see that running the multiple steps with the GPU is over ten times faster than running on the CPU alone.

The result of our analysis is a hierarchical cluster that we can display as a dendrogram like Figure 1 using R’s `plot`

command.

> plot(gpuhclust,hang = -1)

Our results clearly show that running this type of analysis on a GPU makes a lot of sense. Not only can we complete calculations ten times faster, but just as importantly we can reduce the cost of resources required to do our work. We can use these efficiencies to do more thorough analysis and explore more scenarios. By using the Teraproc service, we make GPU computing much more accessible to R programmers who may not otherwise have access to GPU-capable nodes.

In a future post we’ll show how you can tackle very large analysis problems with clusters of GPU-capable machines. Try out Teraproc R Analytics Cluster-as-a-Service today! To learn about other ways to accelerate your R code with GPUs, check out the post Accelerate R Applications with CUDA by NVIDIA’s Patric Zhao.

]]>Neural machine translation is a recently proposed framework for machine translation based purely on neural networks. This post is the first of a series in which I will explain a simple encoder-decoder model for building a neural machine translation system [Cho et al., 2014; Sutskever et al., 2014; Kalchbrenner and Blunsom, 2013]. In a later post I will describe how an attention mechanism can be incorporated into the simple encoder-decoder model [Bahdanau et al., 2015], leading to the state-of-the-art machine translation model for a number of language pairs including En-Fr, En-De, En-Tr and En-Zh [Gulcehre et al., 2015; Jean et al., 2015]. Furthermore, I will introduce recent work which has applied this framework of neural machine translation to image and video description generation [Xu et al., 2015; Li et al., 2015].

First, let’s start with a brief overview of machine translation. In fact, the name, machine translation, says everything. We want a machine to translate text in one language, which we will call the source sentence, to corresponding text in another language, which we call the target sentence. (Although ideally the machine should be able to translate a whole document from one language to another, let us concentrate in this blog post on sentence-level machine translation.)

There are multiple ways to build such a machine that can translate languages. For instance, we can ask a bilingual speaker to give us a set of rules transforming a source sentence into a correct translation. This is not a great solution, as you can imagine, because we don’t even know the set of rules underlying a single language, not to mention the rules underlying a pair of languages. It is simply hopeless to write an exhaustive set of rules for translating a source sentence into a correct translation. Hence, in this blog post, we focus on a statistical approach where those rules, either implicitly or explicitly, are automatically extracted from a large corpus of text.

This statistical approach to machine translation is called statistical machine translation. The goal is the same (build a machine that translates a sentence from one language to another), but we let the machine learn from data how to translate rather than design a set of rules for the machine (See Fig. 1 for a graphical illustration.) Learning is based on statistical methods, which should sound familiar to anyone who has taken a basic course on machine learning. In fact, statistical machine translation is nothing but a particular application of machine learning, where the task is to find a function that maps from a source sentence to a corresponding target.

One important characteristics of machine translation is that the target (translation) function is neither one-to-one nor many-to-one as in many other applications of machine learning (such as classification, which is many-to-one), but one-to-many in the sense that one source sentence can be translated into many possible translations. Because of this, we model the translation function not as a deterministic function but as a conditional probability of a target sentence (translation) given . The conditional probability may apply an equally high probability to more than one well-separated configurations/sentences, leading to a one-to-many relationship between source and target sentences.

Now let’s say you want to build a statistical machine translation system that translates a source sentence in English to a sentence in French. The first and probably most important job is to collect pairs of source sentences and their corresponding translations. I will use and to represent a pair of source and corresponding translation, respectively. The superscript means that it’s the -th pair in a set of many more pairs (often, we need tens to hundreds of thousands of pairs to train a good translation model.) I’ll use to denote the data set with pairs.

Where can I get these training pairs? For widely used languages in machine translation, you probably want to check out the Workshop on Statistical Machine Translation or the International Workshop on Spoken Language Translation.

With the training data in hand, we can now score a model by looking at how well the model works on the training data . The score, which I’ll call the log-likelihood of the model, is the average of the log-likelihood of the model on each pair . With the probabilistic interpretation of the machine translation model, the log-likelihood of the model on each pair is simply how high a log-probability the model assigns to the pair: , where is a set of parameters that defines the model. Then, the overall score of the model on the training data is

If the log-likelihood is low, the model is not giving enough probability mass to the correctly translated pairs, meaning that it’s wasting its probability mass on some wrong translations. Thus, we want to find a configuration of the model, or the values of the parameters $\theta$ that maximizes this log-likelihood, or score.

In machine learning, this is known as a maximum likelihood estimator. But we’re left with a perhaps more important question: how do we model ?

This question of how to model the conditional distribution has been asked and answered for a long time, starting more than 20 years ago at IBM T.J. Watson Research Center [Brown et al., 1993 and references therein]. The core of research on statistical machine translation (SMT) since then has been a log-linear model, where we approximate the logarithm of the true with a linear combination of many features:

where is the normalization constant. In this case, a large part of the research comes down to finding a good set of feature functions , and there is a very well-written textbook that covers all the details about it [Koehn, 2009].

In this approach of statistical machine translation, often the only thing left to machine learning is to find a nice set of coefficients that balance among different features, or to filter/re-rank a set of potential translations decoded from the log-linear model [Schwenk, 2007]. More specifically, neural networks have been used both as a part of the feature functions or to re-rank so-called -best lists of possible translations, as in the middle and right panels of Fig. 2.

In this blog post, on the other hand, I focus on a recently proposed approach, called neural machine translation, where machine learning, and more specifically a neural network, has more or even full control, as in the left panel of Fig. 2.As is usual with general deep learning, neural machine translation (NMT) does not rely on pre-designed feature functions. (By pre-designed feature functions, I mean those that are not learned.) Rather, the goal of NMT is to design a fully trainable model of which every component is tuned based on training corpora to maximize its translation performance.

A fully trainable NMT model $\mathcal{M}$ starts from as raw a representation of a source sentence as possible and finishes by generating as raw a representation of a target sentence as possible. Here, let’s consider a sequence of words as the most raw representation of a sentence. (This is not true for most natural languages, but without loss of generality, I will consider a word the smallest unit.) Each word in a sequence is represented by its integer index in a vocabulary. For instance, in the vocabulary of English sorted according to frequency, the will be the first word, represented as an integer 1. Let me use to denote a source sentence, and a target sentence.

Given a source sequence of word indices, the NMT model computes the conditional probability of . Next I’ll discuss how we can build a neural network to approximate this conditional probability .

One important property of machine translation, or any task based on natural languages, is that we deal with variable-length input and output . In other words, and are not fixed.

To deal with these types of variable-length input and output, we need to use a recurrent neural network (RNN). Widely used feed-forward neural networks, such as convolutional neural networks, do not maintain internal state other than the network’s own parameters. Whenever a single sample is fed into a feed-forward neural network, the network’s internal state, or the activations of the hidden units, is computed from scratch and is not influenced by the state computed from the previous sample. On the other hand, an RNN maintains its internal state while reading a sequence of inputs, which in our case will be a sequence of words, thereby being able to process an input of any length.

Let me explain this in more detail. The main idea behind RNNs is to compress a sequence of input symbols into a fixed-dimensional vector by using recursion. Assume at step that we have a vector which is the history of all the preceding symbols. The RNN will compute the new vector, or its internal state, which compresses all the preceding symbols as well as the new symbol by

where is a function parametrized by which takes as input the new symbol and the history up to the -th symbol. Initially, we can safely assume that is an all-zero vector.

The recurrent activation function is often implemented as, for instance, a simple affine transformation followed by an element-wise nonlinear function:

In this formulation, the parameters include the input weight matrix , the recurrent weight matrix and the bias vector . I must say that this is not the only possibility, and there is a very large room for designing a novel recurrent activation function. See Fig. 3 for some examples from [Pascanu et al., 2014].

This simple type of RNN can be implemented very easily using (for instance) Theano, which allows your RNN to be run on either the CPU or the GPU transparently. See Recurrent Neural Networks with Word Embeddings; note that the whole RNN code is written in less than 10 lines!

Recently, it has been observed that it is better, or easier, to train a recurrent neural network with more sophisticated activation functions such as long short-term memory units [Hochreiter and Schmidhuber, 1997] and gated recurrent units [Cho et al., 2014].

As was the case with the simple recurrent activation function, the parameters here include the input weight matrices , and , the recurrent weight matrices , and and the bias vectors , and .

Although these units look much more complicated than the simple RNN, the implementation with Theano, or any other deep learning framework, such as Torch, is just as simple. For instance, see LSTM Networks for Sentiment Analysis (example code).

I have explained a recurrent neural network (RNN) as a history compressor, but it can also be used to probabilistically model a sequence. Here, by probabilistically modeling a sequence I mean a machine learning model that computes the probability of any given sequence . How can we formulate such that it can be written as a recurrence?

Let’s start by rewriting into

which comes from the definition of conditional probability, From this, we can make a recursive formula such that

Now, we let an RNN model at each time by

outputs a probability distribution conditioned on the whole history up to the -th symbol via . In other words, at each time step, the RNN tries to predict the next symbol given the history of the input symbols.

There are a lot of interesting properties and characteristics of a recurrent neural network that I would love to spend hours talking about, but I will have to stop here for this blog post, since after all, what I have described so far are all the things you need to start building a neural machine translation system. For those who are more interested in recurrent neural networks, I suggest you to read the following papers. Obviously, this list is definitely not exhaustive. Or, you can also check out my slides on how to use recurrent neural networks for language modeling.

- Graves, Alex. “Generating sequences with recurrent neural networks.” arXiv preprint arXiv:1308.0850 (2013).
- Pascanu, Razvan et al. “How to construct deep recurrent neural networks.” arXiv preprint arXiv:1312.6026 (2013).
- Boulanger-Lewandowski, Nicolas, Yoshua Bengio, and Pascal Vincent. “Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription.” arXiv preprint arXiv:1206.6392 (2012).
- Mikolov, Tomas et al. “Recurrent neural network based language model.” INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010 1 Jan. 2010: 1045-1048.
- Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780.
- Cho, Kyunghyun et al. “Learning phrase representations using rnn encoder-decoder for statistical machine translation.” arXiv preprint arXiv:1406.1078 (2014).
- Bengio, Yoshua, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencies with gradient descent is difficult.” Neural Networks, IEEE Transactions on 5.2 (1994): 157-166.

In this post, I introduced machine translation, and described how statistical machine translation approaches the problem of machine translation. In the framework of statistical machine translation, I have discussed how neural networks can be used to improve the overall translation performance.

The goal of this blog series is to introduce a novel paradigm for neural machine translation; this post laid the groundwork, concentrating on two key capabilities of recurrent neural networks: sequence summarization and probabilistic modeling of sequences.

Based on these two properties, in the next post, I will describe the actual neural machine translation system based on recurrent neural networks. I’ll also show you why GPUs are so important for Neural Machine Translation! Stay tuned.

]]>Today software companies use frameworks such as .NET to target multiple platforms from desktops to mobile phones with a single code base to reduce costs by leveraging existing libraries and to cope with changing trends. While developers can easily write scalable parallel code for multi-core CPUs on .NET with libraries such as the task parallel library, they face a bigger challenge using GPUs to tackle compute intensive tasks. To accelerate .NET applications with GPUs, developers must write functions in CUDA C/C++ and write or generate code to interoperate between .NET and CUDA C/C++.

Alea GPU closes this gap by bringing GPU computing directly into the .NET ecosystem. With Alea GPU you can write GPU functions in any .NET language you like, compile with your standard .NET build tool and accelerate it with a GPU. Alea GPU offers a full implementation of all CUDA features, and code compiled with Alea GPU performs as well as equivalent CUDA C/C++ code.

Alea GPU is a professional CUDA development stack for .NET and Mono built directly on top of the NVIDIA compiler toolchain. Alea GPU offers the following benefits:

- Easy to use
- Cross-platform
- Support for many existing GPU algorithms and libraries
- Debugging and profiling functionality
- JIT compilation and a compiler API for GPU scripting
- Future-oriented technology based on LLVM
- No compromise on performance

You can easily install Alea GPU as a Nuget package, as Figure 1 shows.

Alea GPU is easy to use for all kinds of parallel problems. Developers can write GPU code in any .NET language and use the full set of CUDA device functions provided by NVIDIA LibDevice, as well as CUDA device parallel intrinsic functions, such as thread synchrhonization, warp vote functions, warp shuffle functions, and atomic functions. Let’s consider a simple example which applies the same calculation to many data values. `SquareKernel`

is a GPU kernel written in C# that accesses memory on the GPU.

static void SquareKernel(deviceptr outputs, deviceptr inputs, int n) { var start = blockIdx.x * blockDim.x + threadIdx.x; var stride = gridDim.x * blockDim.x; for (var i = start; i < n; i += stride) { outputs[i] = inputs[i] * inputs[i]; } }

Alea GPU kernels require no special attribution and have access to the full CUDA semantics. Invoking a CUDA kernel requires configuring the thread block and grid layout, transferring data to device memory, and launching the kernel. The above `SquareKernel`

GPU function can be launched as shown in the following code.

static double[] SquareGPU(double[] inputs) { var worker = Worker.Default; using (var dInputs = worker.Malloc(inputs)) using (var dOutputs = worker.Malloc(inputs.Length)) { const int blockSize = 256; var numSm = worker.Device.Attributes.MULTIPROCESSOR_COUNT; var gridSize = Math.Min(16 * numSm, Common.divup(inputs.Length, blockSize)); var lp = new LaunchParam(gridSize, blockSize); worker.Launch(SquareKernel, lp, dOutputs.Ptr, dInputs.Ptr, inputs.Length); return dOutputs.Gather(); } }

When we call `worker.Launch`

, Alea GPU Just-In-Time (JIT) compiles the kernel function `SquareKernel`

, loads it into the `worker`

and executes it on the GPU attached to the worker.

The JIT compilation workflow is extremely flexible. It allows code generation and execution on the fly, enabling GPU scripting and rapid prototyping. JIT compilation is also very useful for application scenarios where the algorithms depend on runtime information. JIT compilation adds a small start-up time overhead and requires deployment of the Alea GPU compiler along with the application.

An alternative is Ahead-Of-Time (AOT) compilation. For kernel functions tagged with the attribute `AOTCompile`

, the Alea GPU compiler generates PTX code at compile time and embeds it into the assembly as a binary resource.

[AOTCompile] static void SquareKernel(deviceptr outputs, deviceptr inputs, int n) ...

AOT compilation saves run-time compilation overhead and simplifies deployment because only the Alea GPU runtime components need to be installed. More details about JIT and AOT compilation can be found in the Alea GPU manual.

Another benefit of GPU development in .NET is that all GPU resources are managed, thus simplifying development and leading to more robust code. For example, all memory objects allocated through a `Worker`

instance are disposable. The `using`

statement

using (var dOutputs = worker.Malloc(inputs.Length)) { ... }

is a convenient syntax that ensures the correct use of `IDisposable`

objects, providing a clean and safe mechanism for releasing unmanaged resources. You can find more details in the Alea GPU tutorial.

Alea GPU is fully cross-platform. The code is compiled on one platform and the resulting assembly is binary compatible with all other platforms. Alea GPU supports Windows, Linux, Mac OS X and is also tested on the ARM based Tegra development kits.

In combination with other .NET libraries, impressive cross-platform GPU-accelerated applications with sophisticated user interfaces or graphics visualization can be developed. The n-body simulation (Figure 2) in the Alea GPU tutorial is an example which uses OpenGL through OpenTK to display the simulation results. Its code base is 100% cross-platform.

Developing high-performance generic GPU kernels for basic parallel primitives such as scan, reduce, sort or linear algebra codes for parallelized matrix multiplication or linear system solving is challenging and time-consuming.

Alea GPU offers productivity gains in the form of a range of GPU algorithms and integrated libraries such as cuBLAS and cuDNN. These library interfaces are fully type-safe, and library functions can be mixed seamlessly with custom GPU kernels developed in .NET as both rely on the same memory management and data types for GPU memory and GPU pointers.

Alea GPU provides a rich set of device-side functions and advanced CUDA features which are useful for creating sophisticated GPU algorithms, including

- All CUDA intrinisic functions such as
`__ballot`

,`__atomic_add`

, etc.; - The complete set of LibDevice functions;
- Additional useful functions exposed under
`LibDeviceEx`

.

Alea GPU is flexible enough to handle complex CUDA code found in some advanced CUDA C++ libraries. A good example is the CUB library of generic GPU parallel algorithm primitives. We have ported a subset of the CUB primitives to .NET using Alea GPU and made them available in Alea Unbound. Here is an example of how to use the device level sum scan primitive in C#:

public static void DeviceScanInclusive() { const int numItems = 1000000; var rng = new Random(42); var inputs = Enumerable.Range(0, numItems).Select(i => rng.Next(-10, 10)).ToArray(); var gpuScanModule = DeviceSumScanModuleI32.Default; using (var gpuScan = gpuScanModule.Create(numItems)) using (var dInputs = gpuScanModule.GPUWorker.Malloc(inputs)) using (var dOutputs = gpuScanModule.GPUWorker.Malloc(inputs.Length)) { gpuScan.InclusiveScan(dInputs.Ptr, dOutputs.Ptr, numItems); var actual = dOutputs.Gather(); Assert.AreEqual(actual, inputs.ScanInclusive(0, (a, b) => a + b).ToArray()); } }

The generic scan primitive `Primitives.DeviceScanModule`

expects the binary operator to be used in the scan process as a delegate `Func`

:

public static void DeviceGenericScanInclusive() { const int numItems = 1000000; var rng = new Random(42); var inputs = (Enumerable.Repeat(rng, numItems).Select(gen)).ToArray(); FuncscanOp = Math.Max; using (var gpuScanModule = new Primitives.DeviceScanModule(GPUModuleTarget.DefaultWorker, scanOp)) using (var gpuScan = gpuScanModule.Create(numItems)) using (var dInputs = gpuScanModule.GPUWorker.Malloc(inputs)) using (var dOutputs = gpuScanModule.GPUWorker.Malloc(inputs.Length)) { gpuScan.InclusiveScan(dInputs.Ptr, dOutputs.Ptr, numItems); var actual = dOutputs.Gather(); Assert.AreEqual(actual, inputs.ScanInclusive(zero, scanOp).ToArray()); } }

Following the design of CUB, Alea Unbound has warp, block and device-wide primitives. The warp- and block-wide primitives can be used in kernels as convenient plugin components to write new algorithms. Alea Unbound algorithms deliver the same performance as the CUB CUDA C/C++ counterparts. They are all implemented in F# using warp shuffle or shared memory with union storage for optimal shared memory use.

Besides the primitive algorithms, Alea Unbound also provides fast implementations of matrix multiplication, matrix transpose, random number generators, statistical functions and some linear system solvers.

Alea GPU provides first class tools for coding, debugging and profiling which are fully integrated into Visual Studio. GPU kernels developed with Alea GPU can be debugged on Windows with the NVIDIA Nsight Visual Studio Debugger.

To support debugging and profiling, Alea GPU has three compilation levels: `Optimized`

, `Profiling`

and `Diagnostic`

:

Level | Description | Profiling | Debugging |
---|---|---|---|

Optimized | No source line information nor variable meta data | No | No |

Profiling | Source line information but no variable meta data | Yes | No |

Diagnostic | Source line information and variable meta data | Yes | Yes |

The Nsight Visual Studio debugger allows breakpoints to be directly set in Alea GPU source code even in F# code quotations. The full range of standard debugging features are available such as memory inspection, local variable values and memory checks, as Figure 3 shows. Debugging functionality is based upon LLVM debug meta information which is generated by Alea GPU.

The compilation level `Profiling`

also supports source code correlation, as Figure 4 shows.

The Alea GPU tutorial has detailed explanations about how to debug and profile GPU-accelerated .NET applications.

Developing GPU algorithms is often an iterative process. Usually many variations have to be explored to fine-tune the algorithm. GPU scripting and rapid prototyping greatly improve productivity and encourage the developer to thoroughly investigate the efficiency of the code.

Alea GPU is the only solution that can deliver GPU scripting and a REPL in Visual Studio for interactive prototyping of GPU code. F# code can be directly sent to the F# interactive console for execution. The JIT compilation mode of Alea GPU allows for the execution of F# GPU code on the fly in the F# interactive console, as Figure 5 shows.

GPU code can also be embedded in F# scripts which are then executed with `fsi.exe`

as ordinary scripts on the console.

The example above illustrates the most basic usage of Alea GPU. It uses plain functions or methods for GPU code and a separate function for memory management and kernel execution and is therefore very well-suited for simple applications. For more complex problems Alea GPU offers two alternatives: Class instances and Workflows.

Class instances use a class derived from `GPUModule`

or `ILGPUModule`

to manage all GPU resources. CUDA compile-time arguments can be supplied to the constructor. This allows for the creation of advanced kernels using generics, such as the following example generic map kernel.

internal class TransformModule : ILGPUModule { private readonly Funcop; public TransformModule(GPUModuleTarget target, Func opFunc) : base(target) { op = opFunc; } [Kernel] public void Kernel(int n, deviceptr x, deviceptr y) { var start = blockIdx.x * blockDim.x + threadIdx.x; var stride = gridDim.x * blockDim.x; for (var i = start; i < n; i += stride) y[i] = op(x[i]); } ... }

Workflows specify all GPU resources and kernels in composable `cuda {...}`

workflow blocks. This exposes the full expressive power of Alea GPU and is very well suited for scripting. This feature is only available in F#.

let template (transform:Exprint -> int>) = cuda { let! kernel = <@ fun (z:deviceptr ) (x:deviceptr ) (y:deviceptr ) (n:int) -> let start = blockIdx.x * blockDim.x + threadIdx.x let stride = gridDim.x * blockDim.x let mutable i = start while i < n do z.[i] <- (%transform) x.[i] y.[i] i <- i + stride @> |> Compiler.DefineKernel return Entry(fun program -> let worker = program.Worker let kernel = program.Apply kernel let lp = LaunchParam(16, 256) let run (x:int[]) (y:int[]) = let n = inputs1.Length use x = worker.Malloc(x) use y = worker.Malloc(y) use z = worker.Malloc(n) kernel.Launch lp z.Ptr x.Ptr y.Ptr n outputs.Gather() run) }

You can find more details about the programming approaches that Alea GPU supports in the Alea GPU tutorial.

Alea GPU is a complete compiler built on top of the popular LLVM compiler infrastructure and the NVIDIA CUDA compiler SDK. Code compiled with Alea GPU delivers the same performance as equivalent CUDA C/C++ or CUDA Fortran. Alea GPU takes advantage of the code optimization passes in LLVM as well as the GPU-specific optimizations provided in the NVIDIA NVVM compiler back-end.

Alea GPU contains individual packages which can be conveniently installed through NuGet. The deployment package Alea.CUDA.Fody installs the necessary compilers together with the ahead-of-time compilation tool.

To install Alea GPU, run the following command in the

Package Manager Console:

PM> Install-Package Alea.CUDA.Fody

Alea GPU requires a CUDA-capable GPU with compute capability 2.0 (Fermi architecture) or higher and an installed CUDA driver version 6.5 or higher.

Finally you need a Alea GPU license. The community edition is free and supports consumer GPUs of the GeForce product line. Register on the QuantAlea web page, select `Client Login`

and sign up to retrieve a free community edition license. For applications which require enterprise hardware or mobile GPUs QuantAlea provides commercial licenses.

We are interested to hear all of your feedback and suggestions for Alea GPU. Write to us at info@quantalea.com or @QuantAlea on Twitter.

]]>```
```Went from training 700 img/s in MNIST to 1500 img/s (using CUDA) to 4000 img/s (using cuDNN) that is just freaking amazing! @GPUComputing

— Leon Palafox (@leonpalafox) March 27, 2015

I stumbled upon the above tweet by Leon Palafox, a Postdoctoral Fellow at the The University of Arizona Lunar and Planetary Laboratory, and reached out to him to discuss his success with GPUs and share it with other developers interested in using deep learning for image processing.

We are working on developing a tool that can automatically identify various geological processes on the surface of Mars. Examples of geological processes include impact cratering and volcanic activity; however, these processes can generate landforms that look very similar, even though they form via vastly different mechanisms. For example, small impact craters and volcanic craters can be easily confused because they can both exhibit a prominent rim surrounding a central topographic depression.

Of particular interest to our research group is the automated mapping of volcanic rootless cones as Figure 2 shows. These landforms are generated by explosive interactions between lava and ground ice, and therefore mapping the global distribution of rootless cones on Mars would contribute to a better understanding of the distribution of near-surface water on the planet. However, to do this we must first develop algorithms that can correctly distinguish between landforms of similar appearance. This is a difficult task for planetary geologists, but we are already having great success by applying state-of-the-art artificial neural networks to data acquired by the High Resolution Imaging Science Experiment (HiRISE) camera, which is onboard the Mars Reconnaissance Orbiter (MRO) satellite.

The project is in the development phase; we expect to have it completed in one or two years depending on the number of features that we wish to train for. Before, we spent much of our processing time on processing the images with the CNN, but now, thanks to the NVIDIA cuDNN library, we have substantially reduced that time.

As of now, we are focusing on the identification of volcanic rootless cones and impact craters, but plan to extend our search to include other landforms like sand dunes, recurring slope lineae (thought to be formed by seasonal seeps of surface water), and cloud formations. Of particular interest are dynamic phenomena because once we have developed a robust identification algorithm we can apply it to time series satellite observations to investigate how the Martian environment changes through time. Mars provides an ideal place to develop and test such approaches, but our ultimate aim will be to apply similar techniques to study the Earth.

We used a MATLAB approach to access the CUDA library, since at present most of our implementation is in MATLAB. We use the MatConvNet framework, which—like Theano and Caffe—provides a great set of tools to build and deploy your own Convolutional Neural Network (CNN) architectures. It also provides great CUDA interfaces to the cuDNN library.

We are still fine tuning some of the libraries, but in essence we use a CNN very similar to LeNet, albeit modified to work in this particular regime. We are also running five CNNs in parallel, each of which is using different pixel sizes to search for differently scaled features in the image.

We have five machines and each has two NVIDIA Quadro K5000s.

Figures 1 and 3 show the output in a region in Elysium Planitia, where we use the CNN to map the location of volcanic rootless cones. This process, if done in a larger scale, is an incredible tool for understanding the geologic history of Mars. This has been processed by five CNNs looking for features at different scales, and finally pooled to generate a contour map.

So far we have trained on 800 examples of Martian landforms selected from full-resolution HiRISE images. Each HiRISE image typically has a resolution of 0.25 m/pixel and covers a swath that is 6-km-wide, resulting in file sizes up to 3.4 GB. Individual HiRISE images are spectacular, but the most impressive data products are digital terrain models generated using stereo-photogrammety (Figure 3). This data enables us to visualize the surface of Mars in three dimensions and generate simulated illumination images that can be used to expand our natural training sets.

Additionally, we are using the trained Convolutional Neural Networks (CNNs) to examine thousands of HiRISE images within Elysium Planitia. This is a volcanic region that includes some of the youngest lava flows on Mars, covering a total of millions of square kilometers (Figure 1). Performing a manual search for every rootless cone in this region would be prohibitively time-consuming. Fortunately, automated approaches will enable us to map these landforms over a vast area and use the results of this systematic survey to infer the regional distribution of former ground-ice deposits in Elysium Planitia for the first time.

This project has many challenges, from the algorithm implementation to the analysis of the results. I think the biggest challenge is having readily available databases to use and train over for the different features on the surface of Mars.

While there are some databases, not all of them are very consistent, unlike in the computer vision community, which has the MNIST and CIFAR standards.

This is both good and bad. It is good in the sense that it allows you to tackle a real-world problem with state-of-the-art tools, but since databases are not consistent, there is a lot of skepticism in the community about whether the approach will work with all the features on the surface. However, in planetary science, the situation is different because the data collected from instruments, like HiRISE, is made freely available to the public in a standardized form through the NASA Planetary Data System (PDS).

I’ve used CUDA, but not intensively before now because most of my previous research has focused on processing other kinds of data than images. I had a couple of classes and projects where I used CUDA before, but this has been the first time that it really became critical to optimize the efficiency of the image analyses approach to investigate a problem on such a large scale.

In this short time, having access to a powerful GPU greatly reduces the amount of time that I need to process each of the images that I need to analyze. These images are from the HiRISE database, which consists of 35,000 grayscale and color images with a total database size over 25 TB, and I need to apply different types of CNNs to classify them correctly and to choose the most suitable architecture for our particular problem.

Without using GPUs, it would take days to finish processing a single image, while recent results have shown that an hour or two is enough to process a single image using a GPU.

Having access to different research groups and topics has allowed me to use the experience I have gained from some Machine Learning topics in one area and apply it in a different way to a different area. For example, my work in Bayesian networks applied to gene networks has a huge applicability in my previous work at UCLA on EEG data from Brain–Machine Interfaces.

And surprisingly enough, my work on time series analysis can also find a use in planetary sciences by using things like Hidden Markov Models to model various terrain profiles.

Machine learning has gone from a niche research-oriented area in the 1990s to a boom in the industry in the past decade. Part of it has been the explosion of readily available data that we have due to the information revolution of the last several years.

Many companies are investing a large amount of resources in their own data science divisions, like Facebook, which created its own machine learning laboratory a year ago. This has gotten people more excited about using machine learning tools, since it increases their value in the job market.

This is not without its downside, since many people will apply Machine Learning software without knowing the nuts and bolts of the process, which could result in disappointment for companies in the future, realizing their classifiers and tools are not tuned to their particular datasets. I’ve seen my fair share of implementations where there was no preanalysis of the data whatsoever, and just used the same tuning parameters as the textbook example.

I think that in the next few years more sophisticated Machine Learning tools will become available, and most of the work will be oriented toward large datasets. More effort has to be put into training people how the algorithms work rather than just using them.

I think a more pervasive use of unmanned aerial vehicles (UAVs) would be amazing. Having them readily available running algorithms like CNNs to do feature recognition would allow us to have real-time information about natural disasters, riots and local events.

I can imagine how having a UAV in a large setting like the Coachella Valley Music and Arts Festival or a football stadium would allow the organizers to better control the flow of people in real time to prevent accidents.

Having a UAV using a CNN to track wildfires would allow us to have information on how they spread, and in some cases, how to stop or prevent them.

While the privacy implications are still a concern, I think there is much to gain from this technology, and mounting NVIDIA cards in them would be even better since we could do real-time image processing without the need to transmit video over Wi-Fi.

*Do you have a success story like Leon’s that involves GPUs? If so, comment below and tell us about your research – we love reading and sharing our developer’s stories!*

To learn more about deep learning on GPUs, visit the NVIDIA Deep Learning developer portal. Check out the related posts below, especially the intro to cuDNN: Accelerate Machine Learning with the cuDNN Deep Neural Network Library. Be sure to check out the cuDNN Webinar Recording: GPU-Accelerated Deep Learning with cuDNN. If you are interested in embedded applications of deep learning, check out the post Embedded Machine Learning with the cuDNN Deep Neural Network Library and Jetson TK1.

]]>Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI (Message Passing Interface) calls on the GPU timeline in the profiler. While tools like Vampir and Tau will allow programmers to see a big picture view of how a parallel application performs, sometimes all you need is a look at how MPI is affecting GPU performance on a single node using a simple tool like the NVIDIA Visual Profiler. With the help of the NVIDIA Tools Extensions (NVTX) and the MPI standard itself, this is pretty easy to do.

The NVTX API lets you embed information within a GPU profile, such as marking events or annotating ranges in the timeline with details about application behavior during that time. Jiri Kraus wrote past posts about generating custom application timelines with NVTX, and about using it to label individual MPI ranks in MPI profiles. In this post I’ll show you how to use an NVTX range to annotate the time spent in MPI calls. To do this, we’ll use the MPI profiling interface (PMPI), which is a standard part of MPI. PMPI allows tools to intercept calls to the MPI library to perform actions before or after the MPI call is executed. This means that we can insert NVTX calls into our MPI library calls to mark MPI calls on the GPU timeline.

Wrapping every MPI routine in this way is a bit tedious, but fortunately there’s a tool to automate the process. We’ll use the `wrap.py`

script found at https://github.com/scalability-llnl/wrap to generate the PMPI wrappers for a number of commonly used MPI routines. The input file for this script is the following (also available as a github gist):

#include#include #include // Setup event category name {{fn name MPI_Init}} nvtxNameCategoryA(999, "MPI"); {{callfn}} int rank; PMPI_Comm_rank(MPI_COMM_WORLD, &rank); char name[256]; sprintf( name, "MPI Rank %d", rank ); nvtxNameOsThread(pthread_self(), name); nvtxNameCudaDeviceA(rank, name); {{endfn}} // Wrap select MPI functions with NVTX ranges {{fn name MPI_Send MPI_Recv MPI_Allreduce MPI_Reduce MPI_Wait MPI_Waitany MPI_Waitall MPI_Waitsome MPI_Gather MPI_Gatherv MPI_Scatter MPI_Scatterv MPI_Allgather MPI_Allgatherv MPI_Alltoall MPI_Alltoallv MPI_Alltoallw MPI_Bcast MPI_Sendrecv MPI_Barrier MPI_Start MPI_Test MPI_Send_init MPI_Recv_init }} nvtxEventAttributes_t eventAttrib = {0}; eventAttrib.version = NVTX_VERSION; eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE; eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII; eventAttrib.message.ascii = "{{name}}"; eventAttrib.category = 999; nvtxRangePushEx(&eventAttrib); {{callfn}} nvtxRangePop(); {{endfn}}

So what’s happening in this file? First, it includes the NVTX header file, and then loops over a series of common MPI functions and inserts the beginning of an NVTX range (`nvtxRangePushEx`

) and then ends the range as we leave the MPI routine (`nvtxRangePop`

). For convenience, I’ve named the range after the MPI routine being called. All I need to do now is call `wrap.py`

to generate a C file with my PMPI wrappers, which I’ll then build with my MPI C compiler.

$ python wrap/wrap.py -g -o nvtx_pmpi.c nvtx.w $ mpicc -c nvtx_pmpi.c

Now I just need to rerun my code with these wrappers. To do this I’ll relink my application with the object file I just built and the NVTX library (libnvToolsExt). As an example, I’ll use the simple Jacobi Iteration used in the GTC session Multi GPU Programming with MPI, which you can find on Github. Once I’ve built both the application and the wrappers generated above, I run the executable as follows.

$ mpicc -fast -ta=tesla -Minfo=all $HOME/nvtx_pmpi.o laplace2d.c -L$CUDA_HOME/lib64 -lnvToolsExt -o laplace2d $ MV2_USE_CUDA=1 mpirun -np 2 nvprof -o laplace2d.%q{MV2_COMM_WORLD_RANK}.nvvp ./laplace2d

*One word of caution: the linking order does matter when using tools such as PMPI, so if you run your code and are not seeing the expected results, the object file containing the wrappers may not appear early enough in the build command.*

In the above commands I’m rebuilding my code with the necessary bits. I’m also setting MV2_USE_CUDA at runtime to enable cuda-awareness in my MVAPICH library. Additionally I’m informing nvprof to generate a timeline file per-MPI process by passing the MV2_COMM_WORLD_RANK environment variable to nvprof, which is defined to equal the MPI rank of each process. Figure 1 is the result of importing one of these resulting nvprof output files into Visual Profiler and then zooming in to an area of interest.

Looking in the “Markers and Ranges” row of the GPU timeline for MPI Rank 0, we see three green boxes denoting two calls to MPI_Sendrecv and one to MPI_Allreduce. Furthermore, we can see that the MPI library is using a device-to-device memcpy operation to communicate between two GPUs on the same node. As you can see, the NVIDIA Visual Profiler, combined with PMPI and NVTX can give you interesting insights into how the MPI calls in your application interact with the GPU.

]]>[Note: *Lung Sheng Chien from NVIDIA also contributed to this post.*]

A key bottleneck for most science and engineering simulations is the solution of sparse linear systems of equations, which can account for up to 95% of total simulation time. There are two types of solvers for these systems: iterative and direct solvers. Iterative solvers are favored for the largest systems these days (see my earlier posts about AmgX), while direct solvers are useful for smaller systems because of their accuracy and robustness.

CUDA 7 expands the capabilities of GPU-accelerated numerical computing with cuSOLVER, a powerful new suite of direct linear system solvers. These solvers provide highly accurate and robust solutions for smaller systems, and cuSOLVER offers a way of combining many small systems into a ‘batch’ and solving all of them in parallel, which is critical for the most complex simulations today. Combustion models, bio-chemical models and advanced high-order finite-element models all benefit directly from this new capability. Computer vision and object detection applications need to solve many least-squares problems, so they will also benefit from cuSOLVER.

Direct solvers rely on algebraic factorization of a matrix, which breaks a hard-to-solve matrix into two or more easy-to-solve factors, and a solver routine which uses the factors and a right hand side vector and solves them one at a time to give a highly accurate solution. Figure 1 shows an example of factorization of a dense matrix. A solver for this factorization would first solve the transpose of L part, then apply the inverse of the D (diagonal) part in parallel, then solve again with L to arrive at the final answer. The benefit of direct solvers is that (unlike iterative solvers), they always find a solution (when the factors exist; more on this later) and once a factorization is found, solutions for many right-hand sides can be performed using the factors at a much lower cost per solution. Also, for small systems, direct solvers are typically faster than iterative methods because they only pass over the matrix once.

In this post I give an overview of cuSOLVER followed by an example of using batch QR factorization for solving many sparse systems in parallel. In a followup post I will cover other aspects of cuSOLVER, including dense system solvers and the cuSOLVER refactorization API.

The cuSOLVER library provides factorizations and solver routines for dense and sparse matrix formats, as well as a special re-factorization capability optimized for solving many sparse systems with the same, known, sparsity pattern and fill-in, but changing coefficients. A goal for cuSOLVER is to provide some of the key features of LAPACK on the GPU, as users commonly request LAPACK capabilities in CUDA libraries. cuSOLVER has three major components: cuSolverDN, cuSolverSP and cuSolverRF, for Dense, Sparse and Refactorization, respectively.

Let’s start with cuSolverDN, the dense factorization library. These are the most like LAPACK, in fact cuSOLVER implements the LAPACK API with only minor changes. cuSOLVER includes Cholesky factorization (`potrf`

), LU factorization (`getrf`

), QR factorization (`geqrf`

) and Bunch-Kaufmann (`symtrf`

), as well as a GPU-accelerated triangular solve (`getrs`

, `potrs`

). For solving systems with QR factorization, cuSOLVER provides `ormqr`

to compute the orthogonal columns of Q given A and R, and getrs to solve R. I’ll go into detail on these in a followup post.

cuSolverSP provides sparse factorization and solve routines based on QR factorization. QR can be used for solving linear systems and least-squares problems. QR factorization is very robust, and unlike LU factorization, it doesn’t rely on pivoting.

The cuSolverRF library can quickly update an existing LU factorization as the coefficients of the matrix change. This has application in chemical kinetics, combustion modeling and non-linear finite element methods. I’ll cover this more in a followup post.

For this post, I’ll take a deep look into sparse direct factorization, using the QR and batched QR features of cuSolverSP.

Solving a sparse linear system is not just an arithmetic problem. To achieve the best performance on a given architecture, a sparse direct solver needs to consider multiple factors, and take steps to minimize the difficulties each presents.

**Minimize fill-in during factorization**. To take advantage of the sparse nature of the original matrix A, we have to avoid ‘fill-in’ of the matrix with new non-zeroes during the factorization. The more fill-in, the lower the performance, since we are making more work with every new non-zero added. Furthermore, the fill-in can be so severe that a relatively small problem can consume all device memory, halting the factorization before we can get an answer. Reordering the matrix to reduce this fill-in can dramatically improve performance and increase the size of systems that we can solve using fixed resources.**Discover and Exploit Parallelism.**The symbolic analysis stage of a direct solver can build an ‘elimination tree’ of the matrix which shows independent paths of parallelism. The nodes of the tree with the same depth or ‘level’ can be computed simultaneously. Reordering the matrix may change the elimination tree, which changes the level of parallelism. Hence parallelism depends on the matrix structure if a reordering is given, and if a matrix cannot be reordered to extract parallelism, then maximum performance cannot be attained on modern many-core processors.**Memory efficiency.**Memory access patterns during numerical factorization are irregular, resulting in incoherent memory accesses which cannot achieve the potentially high bandwidth of a GPU. This is a key performance limiter in direct solvers.**Index computation.**Typically, indirect addresses are used to compute matrix factorizations, and addition operations are needed to compute the addresses of individual entries in the matrix storage format (dense or sparse). The data needed to compute these indices typically comes from global memory, which increases access latency.

Graph theory plays an important role in calculations to reorder matrices in direct solvers. We would like to minimize fill-in and predict the final sparsity pattern of the numerical factorization. Both of these goals can be addressed using a ‘symbolic’ factorization—an analysis stage before the actual arithmetic is performed—plus a graph heuristic re-ordering of the columns of the matrix to minimize fill-in. The graph of a symmetric matrix like the matrix *A* in Figure 2 is *undirected*, meaning that for every connection of row *i* to column *j* there is a *symmetric* connection from row *j* to column *i*.

This means that every connection is a two-way street, and as shown in Figure 3, we don’t add arrows to connections in the graph, lines are enough. We can also store only half of the matrix, as the *sparsity diagram* in Figure 2 shows. This has many advantages, not least of which is we generally have only half as much work to do to factor a symmetric matrix because the factors are symmetric. You may have noticed that the factorization in Figure 1 is symmetric.

For non-symmetric matrices like the one in Figure 4, the graph is a *directed* graph, meaning each connection is a one-way street. For a directed graph we have to draw arrows for each connection to indicate which way it flows. This is the general case, and as you can see by the sparsity diagram on the left, we have to store potentially more information, and can’t take advantage of entries being repeated symmetrically. In this case we have to do the full amount of work, no short cuts.

Graph algorithms like Depth First Search (DFS) can be directly applied to the graph incarnation of the matrix. Finding a good fill-reducing ordering for a given matrix involves several calls to DFS. Two common algorithms in this class are Reverse Cuthill-McKee (RCM) for symmetric systems and Approximate Minimum Degree (AMD) for non-symmetric systems. An upcoming update to cuSOLVER will provide these ordering routines. By reducing the fill-in, we arrive at the same solution faster and using much less memory.

You might remember my earlier comment about “when the factors exist”. To see what can go wrong, let’s make an analogy to high-school algebra. We want to factor an equation , so that we get two parts . In this scalar world, a solution exists as long as , and we know the formula: .

In linear algebra, we have the same possibility, the matrix could just be ‘unsolvable’, meaning it doesn’t contain the information we need to find *x* uniquely. We would see this during factorization as a breakdown of the algorithm where it would need to divide by zero, or by a number so numerically close to zero that the computer treats it as zero. To avoid this we can try several approaches, the most direct of which is called ‘pivoting’, or re-ordering the columns of the matrix so a non-zero value is used in the division instead of the zero value. We still get the same answer in the end, but we have to do some book-keeping to remember what columns have been swapped.

In certain circumstances, performance is limited by pivoting. For example, the pivoting strategy in LU factorization requires searching each row that needs to pivot for a ‘valid’ column. The row searches are sequentially dependent: each row search must wait for all previous rows to be finished (columns to be swapped), which limits parallelization. We wish to avoid the need for pivoting, or reduce it as much as possible by reordering the matrix to move large-magnitude elements to the diagonal. This is known as ‘static pivoting’, and it can greatly increase parallelism in LU factorization. We can’t prove that this always works, but a combination of static pivoting and adding a small value on the diagonal when a numerically small pivot is encountered seems to be as robust as pivoting in tests. The price we pay to avoid work during the factorization is that we must do extra work in the solve phase to be sure the solution is accurate enough.

Once a matrix with a certain reordering is given, we can improve memory efficiency by clever factorization algorithms. Traditional “supernodal” or “multifrontal” algorithms can gather sparse data to form blocks and apply efficient BLAS2/BLAS3 operations. This improves memory efficiency on BLAS operations but does not improve the memory efficiency for gathering. If a supernode or front is big enough, the time spent in computation is much bigger than the time required for data transfer, and by overlapping the communication with computation the supernodal or multifrontal algorithm is a win.

In the case that the matrix does not have big supernodes or fronts, we need another way to improve performance.

If only one linear solve is needed, for a single right hand side or a single matrix, there is not much we can do. But if we have a large enough set of linear systems to solve, the situation is different.

Let’s consider a set of linear systems for , where each nonsingular matrix ** **has the same sparsity pattern, and are dense vectors. Such a system is a good fit for a ‘batched’ linear solver. The assumption “each ** **has the same sparsity pattern” implies the following advantages.

- Only one symbolic factorization stage is needed to predict the sparsity structure of the numerical factors independent of the numerical values in each matrix.
- Index computation overhead goes down because all batched operations share the same index to query the data.
- Each linear system can be solved independently, so parallelism comes from both the sparsity pattern of the matrix and trivially parallel batched operations.
- Memory efficiency can be improved by reorganizing the data layout in memory.

cuSOLVER provides batch QR routines to solve sets of sparse linear systems. cuSOLVER’s QR factorization is a simple ‘left-looking’ algorithm, not a supernodal or multifrontal method. However with high memory efficiency and extra parallelism from batch operations, batch QR can reach peak bandwidth if is large enough. In our experiments, we will find batch sizes where batch QR can deliver decent performance.

Let’s walk through a use case for batch QR.

**Step 1:** prepare the matrices, right-hand-side vectors and solution vectors. cuSolverSP require an ‘aggregation’ layout which concatenates the data one after another. For example, is a 1D array of size m*N, and each occupies a contiguous block in memory buffer . The matrices are stored as a 1D array of size , that is, each is a value array of size in CSR format, where is the number of non-zero elements in .

**Step 2:** create an opaque `info`

structure using `cusolverSpCreateCsrqrInfo()`

**Step 3:** perform analysis using `cusolverSpXcsrqrAnalysisBatched()`

, which extracts parallelism and predicts the sparsity of the to-be-created and factors. After analysis, the amount of fill-in is known and we are able to estimate the maximum buffer size needed.

**Step 4:** Find the size of buffer needed by calling `cusolverSpDcsrqrBufferInfoBatched()`

. There are two different buffers, one is an internal data buffer and the other is working space needed for the factorization. The internal buffer is created and kept in the opaque `info`

structure. The size depends on the sparsity of the matrix and the chosen batch size. You should verify that the available device memory is sufficient for the internal buffer. If it is not, you may need to reduce the batch size until the size of the internal buffer is less than the available device memory. On the other hand, the working space is volatile, so you can reuse it for other computations.

**Step 5:** allocate working space explicitly using `cudaMalloc()`

. The internal buffer is allocated automatically.

**Step 6:** Solve the set of linear systems by calling `cusolverSpDcsrqrsvBatched()`

, which computes in parallel for all systems in the batch. This means it internally performs the QR factorizations and solves all the systems in parallel, and returns the set of solutions.

We tested the performance of batched QR factorization on 25 matrices from the Florida Sparse Matrix Collection (http://www.cise.ufl.edu/research/sparse/matrices/ ). The matrices cover a range of sizes and sparsity patterns: some small matrices that can fit into L2 cache; some matrices with large zero fill-in; some matrices with very little parallelism.

Figure 5 compares the performance of a single QR factorization with the performance of batch QR with a batch size of 32. The speedup is the ratio between 32*Time(QR) and Time(batch QR). The x-axis is roughly in order of the density of fill-in. We see good speedup up to 24x when the density of fill-in is small; As the density of fill-in goes up, the performance of batchQR goes down because of more pressure on memory I/O, reducing the ratio of computation to communication.

Figure 2 shows the performance of batch QR for three different batch sizes: 32, 64 and 128. It is clear that batch QR has excellent performance when the fill-in is small. In this regime, the runtime is the same independent of the batch size until we reach the peak bandwidth possible on the device; this is a consequence of “the bigger the batch, the more parallelism available”.

Direct solvers are a new powerful feature in CUDA 7.0. Modern GPU devices have huge bandwidth and floating point throughput, but sparse direct solvers are a challenge because of insufficient parallelism, random access patterns and fill-in of the sparsity pattern. We demonstrated the batch QR factorization approach can naturally increase parallelism and improve memory access efficiency, especially when fill-in is not severe. Batch QR can also be used to solve eigenvalue problems, and to solve many least-squares problems in parallel. Please tell us about your uses for cuSOLVER in the comments!

To get started with cuSOLVER, first download and install the CUDA Toolkit version 7.0. There are more detailed batch QR examples available in the online CUDA documentation. You should also check out the session “Jacobi-Davidson Eigensolver in cuSOLVER Library” from GTC 2015. Try cuSOLVER today!

]]>As you are probably aware, CUDA 7 was officially released during the 2015 GPU Technology Conference. For this Spotlight I took a few minutes to pick the brain of an early adopter of CUDA 7 to see how his work benefits from the new C++11 support.

I interviewed Yu-Hang Tang, a Ph.D. candidate in the Division of Applied Mathematics at Brown University in Providence, Rhode Island.

At this moment we are finalizing a particle-based simulator for the *in silico* investigation of microfluidic devices used in cancer diagnostic. The code enables us to predict the behavior of cancer cells as well as blood cells in various microfluidic channels. It could significantly speed up the process of microfluidic device design, which is usually time-consuming due to the large amount of trial-and-error experiments.

We will release the work by end of April and I will be happy to talk about more details by that time.

I started programming on the GeForce GTX 460 GPUs using OpenCL since 2010, and in 2012 I shifted entirely to CUDA C++.

Right now, I use mostly Kepler GPUs with high double-precision floating-point performance. I have been focused on accelerating particle-based simulations including All-Atom Molecular Dynamics (AAMD), Dissipative Particle Dynamics (DPD) and Smoothed Particle Hydrodynamics (SPH).

In fact, I have developed an entire GPU package (our **_{USER}MESO** package), for the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) particle simulator for DPD and SPH simulations. The package achieves 20x to 30x speed up on a single K20 GPU over 16 AMD CPU cores on a Cray XK7 compute node.

Our **_{USER}MESO** package allows us to simulate DPD systems containing several millions of particles for millions of time steps on a daily basis during the study of the self-assembly behavior of amphiphilic polymers. The multi-compartment multi-walled vesicle, or simply think of it as a miniature cell, as Figure 1 shows, is only observable at a spatial-temporal scale that is tens of times larger, and tens of times longer than that covered by typical contemporary DPD simulations. With the

I use CUDA C++ and I design customized math routines for DPD simulations using hand-coded PTX assembly. The template metaprogramming feature of CUDA C++ also turns out to be handy for writing concise and efficient codes.

Streams, zero-copy memory, texture objects, PTX (parallel thread execution) assembly, warp-level vote/shuffle, template programming.

C++11. The benefit is two-fold:

- I can now directly compile device codes mixed with C++11 host codes. Our host-side C++11 code is a generic library that we use to concurrently couple the GPU-based particle solver with CPU-based continuum solvers. In earlier CUDA releases nvcc would fail if the host codes contain C++11 syntax, and simply passing –Xcompiler –std=c++11 to the host compiler cannot solve it. As a result I had to compile the host code separately and then link with device object files. With CUDA 7 I can simply compose everything in one file and compile by a single shot.
- It allows me to make partial specialization of devices functions, passing lambdas and set template default arguments, etc. This greatly improved my coding productivity.

Figures 2 and 3 were from a system of vesicles spontaneously assembled from amphiphilic polymers in aqueous solution. The result, together with the **_{USER}MESO** code and algorithm, is published in: Tang, Yu-Hang, and George Em Karniadakis. “Accelerating dissipative particle dynamics simulations on GPUs: Algorithms, Numerics and Applications.”

In **_{USER}MESO** I invented a warp-synchronous neighbor list building algorithm that allows the neighbor list for a particle to be constructed deterministically in parallel by all the threads within a warp without using any atomic operations. This actually makes neighbor searching much faster than evaluating pairwise forces, while traditionally the searching takes longer time. For details and visualization of the algorithms you can check out my slides from the 2014 GPU Technology Conference.

And, yes our code **_{USER}MESO** is open source; you can find it on my wiki page.

At the system level I look forward to GPUs that are more tightly coupled to other system parts like CPUs, RAM, interconnects and drives. In terms of the CUDA architecture I think configurable warp size and a faster and bigger non-coherent cache would benefit applications in both my area of interest and many other algorithms.

- The Power of C++11 in CUDA 7
- CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More
- C++11 in CUDA: Variadic Templates

Download CUDA 7 today! If you have tried it, please comment below and let us know your thoughts.

]]>Image recognition and GPUs go hand-in-hand, particularly when using deep neural networks (DNNs). The strength of GPU-based DNNs for image recognition has been unequivocally demonstrated by their success over the past few years in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and DNNs have recently achieved classification accuracy on par with trained humans, as Figure 1 shows. The new Low-Power Image Recognition Challenge (LPIRC) highlights the importance of image recognition on mobile and embedded devices.

DNNs with convolutional layers are a biologically inspired artificial neural network. These networks may have five or more layers with many neurons in each layer. Links similar to synapses connect the layers, forwarding information to the next layer. The training process adjusts weights on the links, improving the network’s ability to classify the information presented to it. The more data used to train a DNN, the better its classification performance. This big data requirement has resulted in heavy GPU use, because GPUs are designed for high throughput on highly parallel computations like those used in deep learning.

ImageNet is a great resource for imagery, hosting a large database of images organized according to a hierarchy of descriptive nouns. Each year, ImageNet hosts the ILSVRC, for which entrants develop algorithms for accurately recognizing objects in the images. ImageNet provides a large image set of over 1.2 million images from 1000 different object categories for training recognition algorithms. Academic as well as industrial participants have performed strongly, with competitors from Google, Stanford University, University of California, Berkeley, and Adobe (among many others) in recent years.

To motivate improved image recognition on low-power devices, Yung-Hsiang Lu, Associate Professor of Electrical and Computer Engineering at Purdue University, and Alex Berg, Assistant Professor of Computer Science at UNC Chapel Hill, are organizing the Low-Power Image Recognition Challenge (LPIRC), a competition focused on identifying the best technology in both image recognition and energy conservation. Registration for the LPIRC is now open.

Achieving high performance while maintaining low power can be challenging, as these two parameters often increase together. Last year NVIDIA released the Jetson TK1 Development Kit, a low-power GPU-accelerated computing platform that is well-suited for image processing and computer vision applications. Jetson TK1’s low power requirements and image processing capabilities will make it a popular platform for LPIRC competitors.

Exemplary tasks for the LPIRC include locating and classifying objects in imagery, including identifying multiple objects. Many ILSVRC teams in the annual ImageNet competition use CNNs, and they are likely to be one of the machine learning architectures deployed for the LPIRC. Visit the LPIRC website for information on important dates and competition tasks.

Jetson TK1 will be a great asset for teams in this competition, with peak power demands of under 12.5 Watts. Jetson TK1 supports CUDA, cuDNN, OpenCV and popular deep learning frameworks like Caffe and Torch. Jetson is essentially a mini-supercomputer that can be used directly with a monitor, keyboard, and mouse or over an ssh connection, and it comes pre-installed with Ubuntu Linux, so getting started is easy.

As a sponsor of the LPIRC, NVIDIA is offering a free Jetson TK1 developer kit to participating teams, and each winner will receive an NVIDIA GPU. If your team would like to use a Jetson TK1 for the LPIRC, fill out this application form. NVIDIA will review proposals and provide TK1 DevKits to selected applicants.

It is easy to port trained networks to the Jetson TK1 and perform classification. Figure 1 shows my setup. For extra disk space, I added an external hard drive and connected it using the SATA interface. I use a 4-port USB 3.0 hub for direct connection via mouse and keyboard and more hard drives.

I built Caffe on my Jetson TK1 so I can use it for classification. Caffe’s web demo example is a simple and easy way to test it out. The web demo runs a web server on the Jetson TK1, and requires the BVLC reference CaffeNet auxiliary data which includes a pretrained network. Once built and running, access the demo from any device on your local network using the Jetson TK1’s IP address, port 500 (e.g. `192.168.1.5:5000`

). You can run the web demo with GPU acceleration with the `-g`

flag:

python examples/web_demo/app.py -g

I used the demo to classify five images with and without GPU acceleration to estimate potential classification speedup with Tegra K1’s GPU vs. its CPU, as shown in the following table.

Image Classified | GPU (ms) | w/o GPU (ms) | speed-up (x) |
---|---|---|---|

Cross-eyed-cat | 1316 | 5553 | 4.2 |

Ship at sea | 1010 | 5571 | 5.5 |

Barn | 1351 | 5791 | 4.3 |

Puppy | 1568 | 5677 | 3.6 |

Pirate ship | 1385 | 5651 | 4.1 |

Average | 1326 | 5649 | 4.3 |

Registration for the LPIRC is now open. The competition will be held on June 7, 2015 at the Design Automation Conference, so get your team together now, and fill out the application for a free NVIDIA Jetson TK1 development board for your team.

]]>Six scientific computing teams from around the world spent an intense week late last year porting their applications to GPUs using OpenACC directives. The Oak Ridge Leadership Computing Facility (OLCF) hosted its first ever OpenACC Hackathon in Knoxville, Tennessee. Paired with two GPU mentors, each team of scientific developers set forth on the journey to accelerate their code with GPUs.

Dr. Misun Min, a computational scientist at Argonne National Laboratory, led the NekCEM Team and she shared the results of accelerating NekCEM with OpenACC and NVIDIA GPUDirect™ communication.

I have only six months experience; but at the time of the Hackathon, I didn’t really have any. The other members included Matthew Otten from Cornell University (six months GPU computing experience), Jing Gong from KTH in Sweden (two years of OpenACC experience), and Azamat Mametjanov from Argonne. The team also had close interactions with Nek5000 developer Paul Fischer at UIUC for useful discussions.

Two mentors from Cray Inc.: Aaron Vose and John Levesque. Aaron and John provided strong technical support to boost the performance of a GPU-enabled NekCEM version.

NekCEM (Nekton for Computational ElectroMagnetics) is an open-source code designed for predictive modeling of electromagnetic systems, such as linear accelerators, semiconductors, plasmonic devices, and quantum systems described by the Maxwell, Helmholtz, drift-diffusion, and Schrödinger or density matrix equations. The code is based on high-order discretizations of the underlying partial differential equations using spectral element (SE) and spectral-element discontinuous Galerkin (SEDG) schemes that have been shown to require order-of-magnitude fewer grid points than do conventional low-order schemes for the same accuracy. NekCEM uses globally unstructured meshes comprising body-fitted curvilinear hexahedral elements, which allow the discrete operators to be expressed as matrix-matrix products applied to arrays of the tensor product basis of Lagrange interpolation polynomials on the Gauss-Lobatto-Legendre quadrature points. The tight coupling of the degrees of freedom within elements leads to efficient data reuse while requiring boundary-minimal (unit-depth-stencil) data communication to effect flux exchanges between neighboring elements.

The team had two goals: (1) to develop a high-performance GPU-based operational variant of NekCEM that supports the full functionality of the existing CPU-only code in Fortran/C and (2) to perform analysis to find performance bottlenecks and infer potential scalability for GPU-based architectures of the future.

OpenACC was chosen as a strategy for porting NekCEM to multiple GPUs because of the relative ease of the pragma-based programming model. During the Hackathon, significant efforts included the development of an OpenACC implementation of the local gradient (see Listing 1) and spectral element curl operator (see Listing 2) for solving the Maxwell equations and a tuned GPUDirect gather-scatter kernel for nearest-neighbor flux exchanges (see Listing 3).

!$ACC DATA PRESENT(u1r,u1s,u1t,u2r,u2s,u2t,u3r,u3s,u3t) !$ACC& PRESENT(u1,u2,u3,d,dtrans) p1=dclock()-p0 ptime=ptime+p1 !$ACC PARALLEL LOOP COLLAPSE(4) GANG WORKER VECTOR !$ACC& private(tmpr1,tmpr2,tmpr3,tmps1,tmps2,tmps3, !$ACC& tmpt1,tmpt2,tmpt3) !dir$ NOBLOCKING do e = 1,nelt do k = 1,nz1 do j = 1,ny1 do i = 1,nx1 tmpr1 = 0.0 tmpr2 = 0.0 tmpr3 = 0.0 tmps1 = 0.0 tmps2 = 0.0 tmps3 = 0.0 tmpt1 = 0.0 tmpt2 = 0.0 tmpt3 = 0.0 !$ACC LOOP SEQ do l=1,nx1 tmpr1=tmpr1+d(i,l)*u1(l,j,k,e) tmpr2=tmpr2+d(i,l)*u2(l,j,k,e) tmpr3=tmpr3+d(i,l)*u3(l,j,k,e) tmps1=tmps1+d(j,l)*u1(i,l,k,e) tmps2=tmps2+d(j,l)*u2(i,l,k,e) tmps3=tmps3+d(j,l)*u3(i,l,k,e) tmpt1=tmpt1+d(k,l)*u1(i,j,l,e) tmpt2=tmpt2+d(k,l)*u2(i,j,l,e) tmpt3=tmpt3+d(k,l)*u3(i,j,l,e) enddo u1r(i,j,k,e) = tmpr1 u2r(i,j,k,e) = tmpr2 u3r(i,j,k,e) = tmpr3 u1s(i,j,k,e) = tmps1 u2s(i,j,k,e) = tmps2 u3s(i,j,k,e) = tmps3 u1t(i,j,k,e) = tmpt1 u2t(i,j,k,e) = tmpt2 u3t(i,j,k,e) = tmpt3 enddo enddo enddo enddo !$ACC END PARALLEL LOOP !$ACC END DATA

!$ACC DATA PRESENT(u1r,u1s,u1t,u2r,u2s,u2t,u3r,u3s,u3t,w1,w2,w3) !$ACC& PRESENT(w3mn,rxmn,sxmn,txmn,rymn,symn,tymn,rzmn,szmn,tzmn) p1=dclock()-p0 ptime=ptime+p1 !$ACC PARALLEL LOOP COLLAPSE(4) GANG WORKER VECTOR do e = 1,nelt do k = 1,nz1 do j = 1,ny1 do i = 1,nx1 w1(i,j,k,e)= (u3r(i,j,k,e)*rymn(i,j,k,e) $ + u3s(i,j,k,e)*symn(i,j,k,e) $ + u3t(i,j,k,e)*tymn(i,j,k,e) $ - u2r(i,j,k,e)*rzmn(i,j,k,e) $ - u2s(i,j,k,e)*szmn(i,j,k,e) $ - u2t(i,j,k,e)*tzmn(i,j,k,e))*w3mn(i,j,k) w2(i,j,k,e)= (u1r(i,j,k,e)*rzmn(i,j,k,e) $ + u1s(i,j,k,e)*szmn(i,j,k,e) $ + u1t(i,j,k,e)*tzmn(i,j,k,e) $ - u3r(i,j,k,e)*rxmn(i,j,k,e) $ - u3s(i,j,k,e)*sxmn(i,j,k,e) $ - u3t(i,j,k,e)*txmn(i,j,k,e))*w3mn(i,j,k) w3(i,j,k,e)= (u2r(i,j,k,e)*rxmn(i,j,k,e) $ + u2s(i,j,k,e)*sxmn(i,j,k,e) $ + u2t(i,j,k,e)*txmn(i,j,k,e) $ - u1r(i,j,k,e)*rymn(i,j,k,e) $ - u1s(i,j,k,e)*symn(i,j,k,e) $ - u1t(i,j,k,e)*tymn(i,j,k,e))*w3mn(i,j,k) enddo enddo enddo enddo !$ACC END PARALLEL LOOP !$ACC END DATA

//* (1) local gather *// for(k=0;kpragma acc parallel loop gang vector present(u[0:uds],map[0:m_size],mapf[0:m_nt*2]) private(i,j,t) async(k+1) for(i=0;i

We examined scaling limits for both GPU and CPU runs as a function of problem size (measured in number of grid points, *n*). The column and the row of the dots in Figures 1 and 2 correspond to strong and weak scaling, respectively, demonstrating that the lower-bound for effective GPU-based solution of Maxwell’s equations with SEDG formulation is approximately *n*=10^{5}. We also explored multi-GPU scalability limits on up to 16,384 GPUs on the Titan Cray XK7, the nation’s most powerful supercomputer for open science.

The effort resulted in a twofold speedup over a highly tuned CPU-only version of the code on the same number of nodes (262,144 MPI ranks) for problem sizes of up to 6.9 billion grid points (see Figure 3). Run-time power consumption data for isolated two-cabinet jobs (192 nodes) demonstrated that the GPU runs required only 39% of the energy needed for comparable CPU runs (see Figure 4).

Want to learn more? Check out these related sessions from GTC 2015.

- Misun and a few of the other OpenACC Hackathon teams shared their experience – the recording is now live.
- Aaron Vose, benchmark and application analyst at Cray Inc. and the NekCEM mentor presented a talk about the “lessons learned” from porting the computational physics applications to the Titan supercomputer with hybrid OpenACC and OpenMP.
- Featured Panel: GPU Computing with OpenACC and OpenMP discusses the current state of GPU programming using compiler directives.

OLCF has partnered with NCSA and CSCS to host three GPU Hackathons this year. They stress that you do not need to have prior GPU experience; you can visit the OLCF website for more information. It’s a priceless opportunity to work with OpenACC experts and to get your code running on GPUs.

]]>The cuDNN library team is excited to announce the second version of cuDNN, NVIDIA’s library of GPU-accelerated primitives for deep neural networks (DNNs). We are proud that the cuDNN library has seen broad adoption by the deep learning research community and is now integrated into major deep learning toolkits such as CAFFE, Theano and Torch. While cuDNN was conceived with developers of deep learning toolkits and systems in mind, this release is all about features and performance for the deep learning practitioner. Before we get into those details though, let’s provide some context.

Data science and machine learning have been growing rapidly in importance in recent years, along with the volume of “big data”. Machine learning provides techniques for developing systems that can automatically recognize, categorize, locate or filter the torrent of big data that flows endlessly into corporate servers (and our email inboxes). Deep neural networks (DNNs) have become an especially successful and popular technique, because DNNs are relatively straightforward to implement *and *scale well—the more data you throw at them the better they perform. Most importantly, DNNs are now established as the most accurate technique across a range of problems, including image classification, object detection, and text and speech recognition. In fact, research teams from Microsoft, Google and Baidu have recently shown DNNs that perform better on an image recognition task than a trained human observer!

Deep learning and machine learning have been popular topics on Parallel Forall recently, so here are some pointers to excellent recent posts for more information. The original cuDNN announcement post provides an introduction to machine learning, deep learning and cuDNN. There are excellent posts on using cuDNN with Caffe for computer vision, with Torch for natural language understanding, on how Baidu uses cuDNN for speech recognition, and on embedded deep learning on Jetson TK1. There is also a recent post about BIDMach, an accelerated framework for machine learning techniques that are *not* neural network-based (SVMs, K-means, linear regression and so on).

The primary goal of cuDNN v2 is to improve performance and provide the fastest possible routines for training (and deploying) deep neural networks for practitioners. This release significantly improves the performance of many routines, especially convolutions. In Figure 1, you can see that cuDNN v2 is nearly 20 times faster than a modern CPU at training large deep neural networks! Figure 1 compares speedup (relative to Caffe running on a 16-core Intel Haswell CPU) on three well-known neural network architectures: Alexnet, Caffenet and GoogLeNet. The grey bar shows the speedup of the native (legacy) Caffe GPU implementation, and the green bar shows the speedup obtained with cuDNN v2. Note that the speedup obtained with cuDNN v2 is now 80% higher than with the legacy Caffe GPU implementation.

cuDNN v2 now allows precise control over the balance between performance and memory footprint. Specifically, cuDNN allows an application to explicitly select one of four algorithms for forward convolution, or to specify a strategy by which the library should automatically select the best algorithm. Available strategies include “prefer fastest” and “use no additional working space”. The four forward convolution algorithms are `IMPLICIT_GEMM`

, `IMPLICIT_PRECOMP_GEMM`

, `GEMM`

and `DIRECT`

.

`IMPLICIT_GEMM`

is the algorithm used in cuDNN v1. It is an in-place computation, and the only algorithm that supports all input sizes and configurations while using no extra working space. If your goal is to fit the largest possible neural network model into the memory of your GPU this is the recommended option.

The `IMPLICIT_PRECOMP_GEMM`

algorithm is a modification of the `IMPLICIT_GEMM`

approach, which uses a small amount of working space (see the Release Notes for details on how much) to achieve significantly higher performance than the original `IMPLICIT_GEMM`

for many use cases.

The GEMM algorithm is an “`im2col`

” approach, which explicitly expands the input data in memory and then uses a pure matrix multiplication. This algorithm requires *significant* working space, but in some cases it is the fastest approach. If you tell cuDNN to “prefer fastest”, it will sometimes choose this approach. You can use the `SPECIFY_WORKSPACE_LIMIT`

instead of `PREFER_FASTEST`

to ensure that the algorithm cuDNN chooses will not require more than a given amount of working space.

The `DIRECT`

option is currently not implemented, so it is really just a placeholder. In a future version of cuDNN this will specify the usage of a direct convolution implementation. We will have guidelines on how this approach compares to the others when it is made available.

Besides performance, there are other new features and capabilities in cuDNN v2 aimed at helping deep learning practitioners get the most out of their systems as easily as possible.

The cuDNN interface has been generalized to support data sets with other than two spatial dimensions (for example, 1D and 3D data). In fact, cuDNN now allows arbitrary *N*-dimensional tensors. This is a forward-looking change; most routines remain limited to two spatial dimensions. As a beta feature in this release, there is now support for 3D datasets (see the Release Notes for details). The cuDNN team is looking for community feedback on the importance of higher dimensional support.

Other new features include OS X support, zero-padding of borders in pooling routines (similar to what was already provided for convolutions), parameter scaling and improved support for arbitrary strides. A number of issues identified in cuDNN v1 have been resolved. cuDNN v2 will support the forthcoming Tegra X1 processor via PTX JIT compilation as well. Please see the cuDNN Release Notes for full details on all of these important developments!

Several of the improvements described above required changes to the cuDNN API. Therefore, cuDNN v2 is not a drop-in version upgrade. Applications previously using cuDNN v1 are likely to need minor changes for API compatibility with cuDNN v2. Note that the `Im2Col`

function is exposed as a public function in cuDNN v2, but it is intended for internal use only, and it will likely be removed from the public API in the next version.

cuDNN is still less than one year old. We expect cuDNN to mature rapidly, making API changes rare in the future. The cuDNN library team genuinely appreciates all feedback from the deep learning community, and carefully considers any API change.

cuDNN is free for anyone to use for any purpose: academic, research or commercial. Just sign up for a registered CUDA developer account. Once your account is activated, log in and you will see a link to the cuDNN download page. You will likely want to start by reading the included User Guide. Get started with cuDNN today!

]]>