# Accelerate .NET Applications with Alea GPU

Today software companies use frameworks such as .NET to target multiple platforms from desktops to mobile phones with a single code base to reduce costs by leveraging existing libraries and to cope with changing trends. While developers can easily write scalable parallel code for multi-core CPUs on .NET with libraries such as the task parallel library, they face a bigger challenge using GPUs to tackle compute intensive tasks. To accelerate .NET applications with GPUs, developers must write functions in CUDA C/C++ and write or generate code to interoperate between .NET and CUDA C/C++.

Alea GPU closes this gap by bringing GPU computing directly into the .NET ecosystem. With Alea GPU you can write GPU functions in any .NET language you like, compile with your standard .NET build tool and accelerate it with a GPU. Alea GPU offers a full implementation of all CUDA features, and code compiled with Alea GPU performs as well as equivalent CUDA C/C++ code.

## CUDA on .NET with Alea GPU

Alea GPU is a professional CUDA development stack for .NET and Mono built directly on top of the NVIDIA compiler toolchain. Alea GPU offers the following benefits:

• Easy to use
• Cross-platform
• Support for many existing GPU algorithms and libraries
• Debugging and profiling functionality
• JIT compilation and a compiler API for GPU scripting
• Future-oriented technology based on LLVM
• No compromise on performance

You can easily install Alea GPU as a Nuget package, as Figure 1 shows.

## Ease of Use

Alea GPU is easy to use for all kinds of parallel problems. Developers can write GPU code in any .NET language and use the full set of CUDA device functions provided by NVIDIA LibDevice, as well as CUDA device parallel intrinsic functions, such as thread synchrhonization, warp vote functions, warp shuffle functions, and atomic functions. Let’s consider a simple example which applies the same calculation to many data values. `SquareKernel` is a GPU kernel written in C# that accesses memory on the GPU.

```static void SquareKernel(deviceptr outputs,
deviceptr inputs, int n)
{
var start = blockIdx.x * blockDim.x + threadIdx.x;
var stride = gridDim.x * blockDim.x;
for (var i = start; i < n; i += stride)
{
outputs[i] = inputs[i] * inputs[i];
}
}```

# Deep Learning for Image Understanding in Planetary Science

I stumbled upon the above tweet by Leon Palafox, a Postdoctoral Fellow at the The University of Arizona Lunar and Planetary Laboratory, and reached out to him to discuss his success with GPUs and share it with other developers interested in using deep learning for image processing.

We are working on developing a tool that can automatically identify various geological processes on the surface of Mars. Examples of geological processes include impact cratering and volcanic activity; however, these processes can generate landforms that look very similar, even though they form via vastly different mechanisms. For example, small impact craters and volcanic craters can be easily confused because they can both exhibit a prominent rim surrounding a central topographic depression.

Of particular interest to our research group is the automated mapping of volcanic rootless cones as Figure 2 shows. These landforms are generated by explosive interactions between lava and ground ice, and therefore mapping the global distribution of rootless cones on Mars would contribute to a better understanding of the distribution of near-surface water on the planet. However, to do this we must first develop algorithms that can correctly distinguish between landforms of similar appearance. This is a difficult task for planetary geologists, but we are already having great success by applying state-of-the-art artificial neural networks to data acquired by the High Resolution Imaging Science Experiment (HiRISE) camera, which is onboard the Mars Reconnaissance Orbiter (MRO) satellite.

# GPU Pro Tip: Track MPI Calls In The NVIDIA Visual Profiler

Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI (Message Passing Interface) calls on the GPU timeline in the profiler. While tools like Vampir and Tau will allow programmers to see a big picture view of how a parallel application performs, sometimes all you need is a look at how MPI is affecting GPU performance on a single node using a simple tool like the NVIDIA Visual Profiler. With the help of the NVIDIA Tools Extensions (NVTX) and the MPI standard itself, this is pretty easy to do.

The NVTX API lets you embed information within a GPU profile, such as marking events or annotating ranges in the timeline with details about application behavior during that time. Jiri Kraus wrote past posts about generating custom application timelines with NVTX, and about using it to label individual MPI ranks in MPI profiles. In this post I’ll show you how to use an NVTX range to annotate the time spent in MPI calls. To do this, we’ll use the MPI profiling interface (PMPI), which is a standard part of MPI. PMPI allows tools to intercept calls to the MPI library to perform actions before or after the MPI call is executed. This means that we can insert NVTX calls into our MPI library calls to mark MPI calls on the GPU timeline.

Wrapping every MPI routine in this way is a bit tedious, but fortunately there’s a tool to automate the process. We’ll use the `wrap.py` script found at https://github.com/scalability-llnl/wrap to generate the PMPI wrappers for a number of commonly used MPI routines. The input file for this script is the following (also available as a github gist):

```#include <pthread.h>
#include <nvToolsExt.h>
#include <nvToolsExtCudaRt.h>
// Setup event category name
{{fn name MPI_Init}}
nvtxNameCategoryA(999, "MPI");
{{callfn}}
int rank;
PMPI_Comm_rank(MPI_COMM_WORLD, &rank);
char name[256];
sprintf( name, "MPI Rank %d", rank );

{{endfn}}
// Wrap select MPI functions with NVTX ranges
{{fn name MPI_Send MPI_Recv MPI_Allreduce MPI_Reduce MPI_Wait MPI_Waitany
MPI_Waitall MPI_Waitsome MPI_Gather MPI_Gatherv MPI_Scatter MPI_Scatterv
MPI_Allgather MPI_Allgatherv MPI_Alltoall MPI_Alltoallv MPI_Alltoallw MPI_Bcast
MPI_Sendrecv MPI_Barrier MPI_Start MPI_Test MPI_Send_init MPI_Recv_init }}
nvtxEventAttributes_t eventAttrib = {0};
eventAttrib.version = NVTX_VERSION;
eventAttrib.size = NVTX_EVENT_ATTRIB_STRUCT_SIZE;
eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
eventAttrib.message.ascii  = "{{name}}";
eventAttrib.category = 999;

nvtxRangePushEx(&eventAttrib);
{{callfn}}
nvtxRangePop();
{{endfn}}
```

So what’s happening in this file? First, it includes the NVTX header file, and then loops over a series of common MPI functions and inserts the beginning of an NVTX range (`nvtxRangePushEx`) and then ends the range as we leave the MPI routine (`nvtxRangePop`). For convenience, I’ve named the range after the MPI routine being called. All I need to do now is call `wrap.py` to generate a C file with my PMPI wrappers, which I’ll then build with my MPI C compiler.

```\$ python wrap/wrap.py -g -o nvtx_pmpi.c nvtx.w
\$ mpicc -c nvtx_pmpi.c
```

Now I just need to rerun my code with these wrappers. To do this I’ll relink my application with the object file I just built and the NVTX library (libnvToolsExt). As an example, I’ll use the simple Jacobi Iteration used in the GTC session Multi GPU Programming with MPI, which you can find on Github. Once I’ve built both the application and the wrappers generated above, I run the executable as follows.

```\$ mpicc -fast -ta=tesla -Minfo=all \$HOME/nvtx_pmpi.o laplace2d.c -L\$CUDA_HOME/lib64 -lnvToolsExt -o laplace2d
\$ MV2_USE_CUDA=1 mpirun -np 2 nvprof -o laplace2d.%q{MV2_COMM_WORLD_RANK}.nvvp ./laplace2d
```

One word of caution: the linking order does matter when using tools such as PMPI, so if you run your code and are not seeing the expected results, the object file containing the wrappers may not appear early enough in the build command.

In the above commands I’m rebuilding my code with the necessary bits. I’m also setting MV2_USE_CUDA at runtime to enable cuda-awareness in my MVAPICH library. Additionally I’m informing nvprof to generate a timeline file per-MPI process by passing the MV2_COMM_WORLD_RANK environment variable to nvprof, which is defined to equal the MPI rank of each process. Figure 1 is the result of importing one of these resulting nvprof output files into Visual Profiler and then zooming in to an area of interest.

Looking in the “Markers and Ranges” row of the GPU timeline for MPI Rank 0, we see three green boxes denoting two calls to MPI_Sendrecv and one to MPI_Allreduce. Furthermore, we can see that the MPI library is using a device-to-device memcpy operation to communicate between two GPUs on the same node. As you can see, the NVIDIA Visual Profiler, combined with PMPI and NVTX can give you interesting insights into how the MPI calls in your application interact with the GPU.

# Parallel Direct Solvers with cuSOLVER: Batched QR

[Note: Lung Sheng Chien from NVIDIA also contributed to this post.]

A key bottleneck for most science and engineering simulations is the solution of sparse linear systems of equations, which can account for up to 95% of total simulation time. There are two types of solvers for these systems: iterative and direct solvers.  Iterative solvers are favored for the largest systems these days (see my earlier posts about AmgX), while direct solvers are useful for smaller systems because of their accuracy and robustness.

CUDA 7 expands the capabilities of GPU-accelerated numerical computing with cuSOLVER, a powerful new suite of direct linear system solvers.   These solvers provide highly accurate and robust solutions for smaller systems, and cuSOLVER offers a way of combining many small systems into a ‘batch’ and solving all of them in parallel, which is critical for the most complex simulations today.   Combustion models, bio-chemical models and advanced high-order finite-element models all benefit directly from this new capability.  Computer vision and object detection applications need to solve many least-squares problems, so they will also benefit from cuSOLVER.

Direct solvers rely on algebraic factorization of a matrix, which breaks a hard-to-solve matrix into two or more easy-to-solve factors, and a solver routine which uses the factors and a right hand side vector and solves them one at a time to give a highly accurate solution. Figure 1 shows an example of $LDL^T$ factorization of a dense matrix.   A solver for this factorization would first solve the transpose of L part, then apply the inverse of the D (diagonal) part in parallel, then solve again with L to arrive at the final answer. The benefit of direct solvers is that (unlike iterative solvers), they always find a solution (when the factors exist; more on this later) and once a factorization is found, solutions for many right-hand sides can be performed using the factors at a much lower cost per solution. Also, for small systems, direct solvers are typically faster than iterative methods because they only pass over the matrix once.

In this post I give an overview of cuSOLVER followed by an example of using batch QR factorization for solving many sparse systems in parallel. In a followup post I will cover other aspects of cuSOLVER, including dense system solvers and the cuSOLVER refactorization API.

# Accelerating Dissipative Particle Dynamics Simulation on Tesla GPUs

As you are probably aware, CUDA 7 was officially released during the 2015 GPU Technology Conference. For this Spotlight I took a few minutes to pick the brain of an early adopter of CUDA 7 to see how his work benefits from the new C++11 support.

I interviewed Yu-Hang Tang, a Ph.D. candidate in the Division of Applied Mathematics at Brown University in Providence, Rhode Island.

### What breakthrough project is currently taking up all of your brain’s time?

At this moment we are finalizing a particle-based simulator for the in silico investigation of microfluidic devices used in cancer diagnostic. The code enables us to predict the behavior of cancer cells as well as blood cells in various microfluidic channels. It could significantly speed up the process of microfluidic device design, which is usually time-consuming due to the large amount of trial-and-error experiments.

We will release the work by end of April and I will be happy to talk about more details by that time.

I started programming on the GeForce GTX 460 GPUs using OpenCL since 2010, and in 2012 I shifted entirely to CUDA C++.

Right now, I use mostly Kepler GPUs with high double-precision floating-point performance. I have been focused on accelerating particle-based simulations including All-Atom Molecular Dynamics (AAMD), Dissipative Particle Dynamics (DPD) and Smoothed Particle Hydrodynamics (SPH).

In fact, I have developed an entire GPU package (our USERMESO package), for the LAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) particle simulator for DPD and SPH simulations. The package achieves 20x to 30x speed up on a single K20 GPU over 16 AMD CPU cores on a Cray XK7 compute node.

### How has GPU computing impacted your research?

Our USERMESO package allows us to simulate DPD systems containing several millions of particles for millions of time steps on a daily basis during the study of the self-assembly behavior of amphiphilic polymers. The multi-compartment multi-walled vesicle, or simply think of it as a miniature cell, as Figure 1 shows, is only observable at a spatial-temporal scale that is tens of times larger, and tens of times longer than that covered by typical contemporary DPD simulations. With the USERMESO code we can perform such simulations daily with just 16 GPUs!

# Get Ready for the Low-Power Image Recognition Challenge with Jetson TK1

Image recognition and GPUs go hand-in-hand, particularly when using deep neural networks (DNNs). The strength of GPU-based DNNs for image recognition has been unequivocally demonstrated by their success over the past few years in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), and DNNs have recently achieved classification accuracy on par with trained humans, as Figure 1 shows. The new Low-Power Image Recognition Challenge (LPIRC) highlights the importance of image recognition on mobile and embedded devices.

DNNs with convolutional layers are a biologically inspired artificial neural network. These networks may have five or more layers with many neurons in each layer. Links similar to synapses connect the layers, forwarding information to the next layer. The training process adjusts weights on the links, improving the network’s ability to classify the information presented to it. The more data used to train a DNN, the better its classification performance. This big data requirement has resulted in heavy GPU use, because GPUs are designed for high throughput on highly parallel computations like those used in deep learning.

ImageNet is a great resource for imagery, hosting a large database of images organized according to a hierarchy of descriptive nouns. Each year, ImageNet hosts the ILSVRC, for which entrants develop algorithms for accurately recognizing objects in the images. ImageNet provides a large image set of over 1.2 million images from 1000 different object categories for training recognition algorithms. Academic as well as industrial participants have performed strongly, with competitors from Google, Stanford University, University of California, Berkeley, and Adobe (among many others) in recent years.

## A Low-Power Challenge

To motivate improved image recognition on low-power devices, Yung-Hsiang Lu, Associate Professor of Electrical and Computer Engineering at Purdue University, and Alex Berg, Assistant Professor of Computer Science at UNC Chapel Hill, are organizing the Low-Power Image Recognition Challenge (LPIRC), a competition focused on identifying the best technology in both image recognition and energy conservation. Registration for the LPIRC is now open.

Achieving high performance while maintaining low power can be challenging, as these two parameters often increase together. Last year NVIDIA released the Jetson TK1 Development Kit, a low-power GPU-accelerated computing platform that is well-suited for image processing and computer vision applications. Jetson TK1’s low power requirements and image processing capabilities will make it a popular platform for LPIRC competitors. Continue reading

# Porting Scientific Applications to GPUs at the OLCF OpenACC Hackathon

Six scientific computing teams from around the world spent an intense week late last year porting their applications to GPUs using OpenACC directives. The Oak Ridge Leadership Computing Facility (OLCF) hosted its first ever OpenACC Hackathon in Knoxville, Tennessee. Paired with two GPU mentors, each team of scientific developers set forth on the journey to accelerate their code with GPUs.

Dr. Misun Min, a computational scientist at Argonne National Laboratory, led the NekCEM Team and she shared the results of accelerating NekCEM with OpenACC and NVIDIA GPUDirect™ communication.

### Who were the NekCEM hackathon team members, and how much GPU computing experience did your team have?

I have only six months experience; but at the time of the Hackathon, I didn’t really have any. The other members included Matthew Otten from Cornell University (six months GPU computing experience), Jing Gong from KTH in Sweden (two years of OpenACC experience), and Azamat Mametjanov from Argonne. The team also had close interactions with Nek5000 developer Paul Fischer at UIUC for useful discussions.

Two mentors from Cray Inc.: Aaron Vose and John Levesque. Aaron and John provided strong technical support to boost the performance of a GPU-enabled NekCEM version.

### First, what is NekCEM?

NekCEM (Nekton for Computational ElectroMagnetics) is an open-source code designed for predictive modeling of electromagnetic systems, such as linear accelerators, semiconductors, plasmonic devices, and quantum systems described by the Maxwell, Helmholtz, drift-diffusion, and Schrödinger or density matrix equations. The code is based on high-order discretizations of the underlying partial differential equations using spectral element (SE) and spectral-element discontinuous Galerkin (SEDG) schemes that have been shown to require order-of-magnitude fewer grid points than do conventional low-order schemes for the same accuracy. NekCEM uses globally unstructured meshes comprising body-fitted curvilinear hexahedral elements, which allow the discrete operators to be expressed as matrix-matrix products applied to arrays of the tensor product basis of Lagrange interpolation polynomials on the Gauss-Lobatto-Legendre quadrature points. The tight coupling of the degrees of freedom within elements leads to efficient data reuse while requiring boundary-minimal (unit-depth-stencil) data communication to effect flux exchanges between neighboring elements.

### What were your team’s goals going into the OpenACC Hackathon?

The team had two goals: (1) to develop a high-performance GPU-based operational variant of NekCEM that supports the full functionality of the existing CPU-only code in Fortran/C and (2) to perform analysis to find performance bottlenecks and infer potential scalability for GPU-based architectures of the future. Continue reading

# cuDNN v2: Higher Performance for Deep Learning on GPUs

The cuDNN library team is excited to announce the second version of cuDNN, NVIDIA’s library of GPU-accelerated primitives for deep neural networks (DNNs). We are proud that the cuDNN library has seen broad adoption by the deep learning research community and is now integrated into major deep learning toolkits such as CAFFE, Theano and Torch. While cuDNN was conceived with developers of deep learning toolkits and systems in mind, this release is all about features and performance for the deep learning practitioner. Before we get into those details though, let’s provide some context.

## Deep Learning for Big Data

Data science and machine learning have been growing rapidly in importance in recent years, along with the volume of “big data”. Machine learning provides techniques for developing systems that can automatically recognize, categorize, locate or filter the torrent of big data that flows endlessly into corporate servers (and our email inboxes). Deep neural networks (DNNs) have become an especially successful and popular technique, because DNNs are relatively straightforward to implement and scale well—the more data you throw at them the better they perform. Most importantly, DNNs are now established as the most accurate technique across a range of problems, including image classification, object detection, and text and speech recognition. In fact, research teams from Microsoft, Google and Baidu have recently shown DNNs that perform better on an image recognition task than a trained human observer!

Deep learning and machine learning have been popular topics on Parallel Forall recently, so here are some pointers to excellent recent posts for more information. The original cuDNN announcement post provides an introduction to machine learning, deep learning and cuDNN. There are excellent posts on using cuDNN with Caffe for computer vision, with Torch for natural language understanding, on how Baidu uses cuDNN for speech recognition, and on embedded deep learning on Jetson TK1. There is also a recent post about BIDMach, an accelerated framework for machine learning techniques that are not neural network-based (SVMs, K-means, linear regression and so on). Continue reading

# C++11 in CUDA: Variadic Templates

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with `nvcc`, but also in device code. In my post “The Power of C++11 in CUDA 7” I covered some of the major new features of C++11, such as lambda functions, range-based for loops, and automatic type deduction (`auto`). In this post, I’ll cover variadic templates.

There are times when you need to write functions that take a variable number of arguments: variadic functions. To do this in a typesafe manner for polymorphic functions, you really need to take a variable number of types in a template. Before C++11, the only way to write variadic functions was with the ellipsis (`...`) syntax and the `va_*` facilities. These facilities did not enable type safety and can be difficult to use.

As an example, let’s say we want to abstract the launching of GPU kernels. In my case, I want to provide simpler launch semantics in the Hemi library. There are many cases where you don’t care to specify the number and size of thread blocks—you just want to run a kernel with “enough” threads to fully utilize the GPU, or to cover your data size. In that case we can let the library decide how to launch the kernel, simplifying our code. But to launch arbitrary kernels, we have to support arbitrary type signatures. Well, we can do that like this:

```template <typename... Arguments>
void cudaLaunch(const ExecutionPolicy &p,
void(*f)(Arguments...),
Arguments... args);
```

Here, `Arguments...` is a “type template parameter pack”. We can use it to refer to the type signature of our kernel function pointer `f`, and to the arguments of `cudaLaunch`. To do the same thing before C++11 (and CUDA 7) required providing multiple implementations of `cudaLaunch`, one for each number of arguments we wanted to support. That meant you had to limit the maximum number of arguments allowed, as well as the amount of code you had to maintain. In my experience this was prone to bugs. Here’s the implementation of `cudaLaunch`. Continue reading

# The Power of C++11 in CUDA 7

Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone!

CUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself here. Instead, I wanted to take a deeper look at C++11 support in device code.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with `nvcc`, but also in device code. New C++ language features include `auto`, lambda functions, variadic templates, `static_assert`, rvalue references, range-based for loops, and more. To enable C++11 support, pass the flag `--std=c++11` to `nvcc` (this option is not required for Microsoft Visual Studio).

In my earlier CUDA 7 feature overview post, I presented a small example to show some C++11 features. Let’s dive into a somewhat expanded example to show the power of C++11 for CUDA programmers. This example will proceed top-down, covering a couple of layers of abstraction that allow us to write concise, reusable C++ code for the GPU, all enabled by C++11. The complete example is available on Github.

Let’s say we have a very specific (albeit contrived) goal: count the number of characters from a certain set within a text. (In parallel, of course!) Here’s a simple CUDA C++11 kernel that abstracts the mechanics of this a bit.

```__global__
void xyzw_frequency(int *count, char *text, int n)
{
const char letters[] { 'x','y','z','w' };

count_if(count, text, n, [&](char c) {
for (const auto x : letters)
if (c == x) return true;
return false;
});
}
```