# Accelerate .NET Applications with Alea GPU

Today software companies use frameworks such as .NET to target multiple platforms from desktops to mobile phones with a single code base to reduce costs by leveraging existing libraries and to cope with changing trends. While developers can easily write scalable parallel code for multi-core CPUs on .NET with libraries such as the task parallel library, they face a bigger challenge using GPUs to tackle compute intensive tasks. To accelerate .NET applications with GPUs, developers must write functions in CUDA C/C++ and write or generate code to interoperate between .NET and CUDA C/C++.

Alea GPU closes this gap by bringing GPU computing directly into the .NET ecosystem. With Alea GPU you can write GPU functions in any .NET language you like, compile with your standard .NET build tool and accelerate it with a GPU. Alea GPU offers a full implementation of all CUDA features, and code compiled with Alea GPU performs as well as equivalent CUDA C/C++ code.

## CUDA on .NET with Alea GPU

Alea GPU is a professional CUDA development stack for .NET and Mono built directly on top of the NVIDIA compiler toolchain. Alea GPU offers the following benefits:

• Easy to use
• Cross-platform
• Support for many existing GPU algorithms and libraries
• Debugging and profiling functionality
• JIT compilation and a compiler API for GPU scripting
• Future-oriented technology based on LLVM
• No compromise on performance

You can easily install Alea GPU as a Nuget package, as Figure 1 shows.

## Ease of Use

Alea GPU is easy to use for all kinds of parallel problems. Developers can write GPU code in any .NET language and use the full set of CUDA device functions provided by NVIDIA LibDevice, as well as CUDA device parallel intrinsic functions, such as thread synchrhonization, warp vote functions, warp shuffle functions, and atomic functions. Let’s consider a simple example which applies the same calculation to many data values. `SquareKernel` is a GPU kernel written in C# that accesses memory on the GPU.

```static void SquareKernel(deviceptr outputs,
deviceptr inputs, int n)
{
var start = blockIdx.x * blockDim.x + threadIdx.x;
var stride = gridDim.x * blockDim.x;
for (var i = start; i < n; i += stride)
{
outputs[i] = inputs[i] * inputs[i];
}
}```

# Deep Learning for Image Understanding in Planetary Science

I stumbled upon the above tweet by Leon Palafox, a Postdoctoral Fellow at the The University of Arizona Lunar and Planetary Laboratory, and reached out to him to discuss his success with GPUs and share it with other developers interested in using deep learning for image processing.

We are working on developing a tool that can automatically identify various geological processes on the surface of Mars. Examples of geological processes include impact cratering and volcanic activity; however, these processes can generate landforms that look very similar, even though they form via vastly different mechanisms. For example, small impact craters and volcanic craters can be easily confused because they can both exhibit a prominent rim surrounding a central topographic depression.

Of particular interest to our research group is the automated mapping of volcanic rootless cones as Figure 2 shows. These landforms are generated by explosive interactions between lava and ground ice, and therefore mapping the global distribution of rootless cones on Mars would contribute to a better understanding of the distribution of near-surface water on the planet. However, to do this we must first develop algorithms that can correctly distinguish between landforms of similar appearance. This is a difficult task for planetary geologists, but we are already having great success by applying state-of-the-art artificial neural networks to data acquired by the High Resolution Imaging Science Experiment (HiRISE) camera, which is onboard the Mars Reconnaissance Orbiter (MRO) satellite.

# Parallel Direct Solvers with cuSOLVER: Batched QR

[Note: Lung Sheng Chien from NVIDIA also contributed to this post.]

A key bottleneck for most science and engineering simulations is the solution of sparse linear systems of equations, which can account for up to 95% of total simulation time. There are two types of solvers for these systems: iterative and direct solvers.  Iterative solvers are favored for the largest systems these days (see my earlier posts about AmgX), while direct solvers are useful for smaller systems because of their accuracy and robustness.

CUDA 7 expands the capabilities of GPU-accelerated numerical computing with cuSOLVER, a powerful new suite of direct linear system solvers.   These solvers provide highly accurate and robust solutions for smaller systems, and cuSOLVER offers a way of combining many small systems into a ‘batch’ and solving all of them in parallel, which is critical for the most complex simulations today.   Combustion models, bio-chemical models and advanced high-order finite-element models all benefit directly from this new capability.  Computer vision and object detection applications need to solve many least-squares problems, so they will also benefit from cuSOLVER.

Direct solvers rely on algebraic factorization of a matrix, which breaks a hard-to-solve matrix into two or more easy-to-solve factors, and a solver routine which uses the factors and a right hand side vector and solves them one at a time to give a highly accurate solution. Figure 1 shows an example of $LDL^T$ factorization of a dense matrix.   A solver for this factorization would first solve the transpose of L part, then apply the inverse of the D (diagonal) part in parallel, then solve again with L to arrive at the final answer. The benefit of direct solvers is that (unlike iterative solvers), they always find a solution (when the factors exist; more on this later) and once a factorization is found, solutions for many right-hand sides can be performed using the factors at a much lower cost per solution. Also, for small systems, direct solvers are typically faster than iterative methods because they only pass over the matrix once.

In this post I give an overview of cuSOLVER followed by an example of using batch QR factorization for solving many sparse systems in parallel. In a followup post I will cover other aspects of cuSOLVER, including dense system solvers and the cuSOLVER refactorization API.

# cuDNN v2: Higher Performance for Deep Learning on GPUs

The cuDNN library team is excited to announce the second version of cuDNN, NVIDIA’s library of GPU-accelerated primitives for deep neural networks (DNNs). We are proud that the cuDNN library has seen broad adoption by the deep learning research community and is now integrated into major deep learning toolkits such as CAFFE, Theano and Torch. While cuDNN was conceived with developers of deep learning toolkits and systems in mind, this release is all about features and performance for the deep learning practitioner. Before we get into those details though, let’s provide some context.

## Deep Learning for Big Data

Data science and machine learning have been growing rapidly in importance in recent years, along with the volume of “big data”. Machine learning provides techniques for developing systems that can automatically recognize, categorize, locate or filter the torrent of big data that flows endlessly into corporate servers (and our email inboxes). Deep neural networks (DNNs) have become an especially successful and popular technique, because DNNs are relatively straightforward to implement and scale well—the more data you throw at them the better they perform. Most importantly, DNNs are now established as the most accurate technique across a range of problems, including image classification, object detection, and text and speech recognition. In fact, research teams from Microsoft, Google and Baidu have recently shown DNNs that perform better on an image recognition task than a trained human observer!

Deep learning and machine learning have been popular topics on Parallel Forall recently, so here are some pointers to excellent recent posts for more information. The original cuDNN announcement post provides an introduction to machine learning, deep learning and cuDNN. There are excellent posts on using cuDNN with Caffe for computer vision, with Torch for natural language understanding, on how Baidu uses cuDNN for speech recognition, and on embedded deep learning on Jetson TK1. There is also a recent post about BIDMach, an accelerated framework for machine learning techniques that are not neural network-based (SVMs, K-means, linear regression and so on). Continue reading

# C++11 in CUDA: Variadic Templates

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with `nvcc`, but also in device code. In my post “The Power of C++11 in CUDA 7” I covered some of the major new features of C++11, such as lambda functions, range-based for loops, and automatic type deduction (`auto`). In this post, I’ll cover variadic templates.

There are times when you need to write functions that take a variable number of arguments: variadic functions. To do this in a typesafe manner for polymorphic functions, you really need to take a variable number of types in a template. Before C++11, the only way to write variadic functions was with the ellipsis (`...`) syntax and the `va_*` facilities. These facilities did not enable type safety and can be difficult to use.

As an example, let’s say we want to abstract the launching of GPU kernels. In my case, I want to provide simpler launch semantics in the Hemi library. There are many cases where you don’t care to specify the number and size of thread blocks—you just want to run a kernel with “enough” threads to fully utilize the GPU, or to cover your data size. In that case we can let the library decide how to launch the kernel, simplifying our code. But to launch arbitrary kernels, we have to support arbitrary type signatures. Well, we can do that like this:

```template <typename... Arguments>
void cudaLaunch(const ExecutionPolicy &p,
void(*f)(Arguments...),
Arguments... args);
```

Here, `Arguments...` is a “type template parameter pack”. We can use it to refer to the type signature of our kernel function pointer `f`, and to the arguments of `cudaLaunch`. To do the same thing before C++11 (and CUDA 7) required providing multiple implementations of `cudaLaunch`, one for each number of arguments we wanted to support. That meant you had to limit the maximum number of arguments allowed, as well as the amount of code you had to maintain. In my experience this was prone to bugs. Here’s the implementation of `cudaLaunch`. Continue reading

# The Power of C++11 in CUDA 7

Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone!

CUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself here. Instead, I wanted to take a deeper look at C++11 support in device code.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with `nvcc`, but also in device code. New C++ language features include `auto`, lambda functions, variadic templates, `static_assert`, rvalue references, range-based for loops, and more. To enable C++11 support, pass the flag `--std=c++11` to `nvcc` (this option is not required for Microsoft Visual Studio).

In my earlier CUDA 7 feature overview post, I presented a small example to show some C++11 features. Let’s dive into a somewhat expanded example to show the power of C++11 for CUDA programmers. This example will proceed top-down, covering a couple of layers of abstraction that allow us to write concise, reusable C++ code for the GPU, all enabled by C++11. The complete example is available on Github.

Let’s say we have a very specific (albeit contrived) goal: count the number of characters from a certain set within a text. (In parallel, of course!) Here’s a simple CUDA C++11 kernel that abstracts the mechanics of this a bit.

```__global__
void xyzw_frequency(int *count, char *text, int n)
{
const char letters[] { 'x','y','z','w' };

count_if(count, text, n, [&](char c) {
for (const auto x : letters)
if (c == x) return true;
return false;
});
}
```

# 12 GTC 2015 Sessions Not to Miss

With one week to go until we all descend on GTC 2015, I’ve scoured through the list of Accelerated Computing sessions and put together 12 diverse “not to miss” talks you should add to your planner. This year, the conference is highlighting the revolution in Deep Learning that will affect every aspect of computing. GTC 2015 includes over 40 session categories, including deep learning and machine learning, scientific visualization, cloud computing, and HPC.

This is the place where scientists, programmers, researchers, and a
myriad of creative professionals convene to tap into the power of a GPU
for more than gaming. –Forbes

## Tuesday, March 17

### An Introduction to CUDA Programming (S5661)

1:00-2:20pm

This is the introductory tutorial intended for those new to CUDA and you will leave with the essential knowledge to start programming in CUDA – no experience is needed! For those that have prior CUDA experience, this is a great session to brush up on key concepts required for subsequent tutorials on CUDA optimization. The other tutorials in this session are: An Introduction to the GPU Memory ModelAsynchronous Operations and Dynamic Parallelism in CUDA and Essential CUDA Optimization Techniques.

### SMTool: A GPU based Satellite Image Analysis Tool (S5201)

2:00-2:25pm

Dilip Patlolla, R&D Engineer in the Geographic Information Science and Technology (GIST) Group at the Oak Ridge National Laboratory, will demonstrate their advanced satellite image analytic tool referred as SMTool built on the CUDA platform to process city-scale sub-meter resolution satellite imagery to detect and discriminate man-made structures. Continue reading

# Learn GPU Computing with Hands-On Labs at GTC 2015

Every year NVIDIA’s GPU Technology Conference (GTC) gets bigger and better. One of the aims of GTC is to give developers, scientists, and practitioners opportunities to learn with hands-on labs how to use accelerated computing in their work. This year we are nearly doubling the amount of hands-on training provided from last year, with almost 2,400 lab hours available to GTC attendees!

We have two types of training this year at GTC: instructor-led labs and self-paced labs. And to help you keep up with one of the hottest trends in computing, this year we’re featuring a Deep Learning training track. Keep reading for details. If you haven’t registered for GTC yet this year, keep reading for a discount code.

## Deep Learning Track

There is an explosion of Deep Learning topics at GTC, and it’s not limited to the keynotes, talks and tutorial sessions. We’ll feature at least six hands-on labs related to accelerating facets of Deep Learning on GPUs. From an introduction to Deep Learning on GPUs to cutting-edge techniques and tools, there will be something for everyone. Be sure to get to these labs early to get yourself a seat! Here are a few of the labs available in this track:

• Introduction to Machine Learning with GPUs: Handwritten digit classification (S5674)
• DIY Deep Learning for Vision with Caffe (S5647)
• Applied Deep Learning for Vision, Natural Language and Audio with Torch7 (S5574)
• Deep Learning with the Theano Python Library (S5732)
• Deep Belief Networks Using ArrayFire (S5722)
• Accelerate a Machine Learning C++ example with Thrust (S5822)

## Instructor-led Labs

Just like GTC last year, there will be twenty hands-on instructor-led labs. These are 80-minute labs led by an expert on the topic. Continue reading

# Accelerating Bioinformatics with NVBIO

NVBIO is an open-source C++ template library of high performance parallel algorithms and containers designed by NVIDIA to accelerate sequence analysis and bioinformatics applications. NVBIO has a threefold focus:

1. Performance, providing a suite of state-of-the-art parallel algorithms that offer a significant leap in performance;
2. Reusability, providing a suite of highly expressive and flexible template algorithms that can be easily configured and adjusted to the many different usage scenarios typical in bioinformatics;
3. Portability, providing a completely cross-platform suite of tools, that can be easily switched from running on NVIDIA GPUs to multi-core CPUs by changing a single template parameter.

## Exponential Parallelism

We built NVBIO because we believe only the exponentially increasing parallelism of many-core GPU architectures can provide the immense computational capability required by the exponentially increasing sequencing throughput.

There is a common misconception that GPUs only excel at highly regular, floating point intensive applications, but today’s GPUs are fully programmable parallel processors, offering superior memory bandwidth and latency hiding characteristics, and R&D efforts at NVIDIA and elsewhere have proved that they can be a perfect match even for branchy, integer-heavy bioinformatics applications. The caveat is that legacy applications need to be rethought for fine-grained parallelism.

Many CPU algorithms are designed to run on few cores and scale to a tiny number of threads. When the number of threads is measured in the thousands, rather than dozens—a fact that all applications inevitably must consider—applications must tackle fundamental problems related to load balancing, synchronization, and execution and memory divergence.

NVBIO does just that, providing both low-level primitives that can be used from either CPU/host or GPU/device threads, as well as novel, highly parallel high-level primitives designed to scale from the ground up. Continue reading

# GPU Pro Tip: Fast Dynamic Indexing of Private Arrays in CUDA

Sometimes you need to use small per-thread arrays in your GPU kernels. The performance of accessing elements in these arrays can vary depending on a number of factors. In this post I’ll cover several common scenarios ranging from fast static indexing to more complex and challenging use cases.

## Static indexing

Before discussing dynamic indexing let’s briefly look at static indexing. For small arrays where all indices are known constants at compile time, as in the following sample code, the compiler places all accessed elements of the array into registers.

```__global__ void kernel1(float * buf)
{
float a[2];
...
float sum = a[0] + a[1];
...
}```

This way array elements are accessed in the fastest way possible: math instructions use the data directly without loads and stores.

A slightly more complex (and probably more useful) case is an unrolled loop over the indices of the array. In the following code the compiler is also capable of assigning the accessed array elements to registers.

```__global__ void kernel2(float * buf)
{
float a[5];
...
float sum = 0.0f;
#pragma unroll
for(int i = 0; i < 5; ++i)
sum += a[i];
...
}```

Here we tell the compiler to unroll the loop with the directive `#pragma unroll`, effectively replacing the loop with all the iterations listed explicitly, as in the following snippet.

```sum += a[0];
sum += a[1];
sum += a[2];
sum += a[3];
sum += a[4];```

All the indices are now constants, so the compiler puts the whole array into registers. Continue reading