# CUDA Pro Tip: Optimize for Pointer Aliasing

Often cited as the main reason that naïve C/C++ code cannot match FORTRAN performance, pointer aliasing is an important topic to understand when considering optimizations for your C/C++ code. In this tip I will describe what pointer aliasing is and a simple way to alter your code so that it does not harm your application performance.

### What is pointer aliasing?

Two pointers alias if the memory to which they point overlaps. When a compiler can’t determine whether pointers alias, it has to assume that they do. The following simple function shows why this is potentially harmful to performance:

void example1(float *a, float *b, float *c, int i) {
a[i] = a[i] + c[i];
b[i] = b[i] + c[i];
}

At first glance it might seem that this function needs to perform three load operations from memory: one for a[i], one for b[i] and one for c[i]. This is incorrect because it assumes that c[i] can be reused once it is loaded. Consider the case where a and c point to the same address. In this case the first line modifies the value c[i] when writing to a[i]. Therefore the compiler must generate code to reload c[i] on the second line, in case it has been modified.

Because the compiler must conservatively assume the pointers alias, it will compile the above code inefficiently, even if the programmer knows that the pointers never alias.

### What can I do about aliasing?

Fortunately almost all C/C++ compilers offer a way for the programmer to give the compiler information about pointer aliasing. Continue reading

# Accelerate R Applications with CUDA

R is a free software environment for statistical computing and graphics that provides a programming language and built-in libraries of mathematics operations for statistics, data analysis, machine learning and much more. Many domain experts and researchers use the R platform and contribute R software, resulting in a large ecosystem of free software packages available through CRAN (the Comprehensive R Archive Network).

However, R, like many other high-level languages, is not performance competitive out of the box with lower-level languages like C++, especially for highly data- and computation-intensive applications. R programs tend to process large amounts of data, and often have significant independent data and task parallelism. Therefore, R applications stand to benefit from GPU acceleration. This way, R users can benefit from R’s high-level, user-friendly interface while achieving high performance.

In this article, I will introduce the computation model of R with GPU acceleration, focusing on three topics:

• accelerating R computations using CUDA libraries;
• calling your own parallel algorithms written in CUDA C/C++ or CUDA Fortran from R; and
• profiling GPU-accelerated R applications using the CUDA Profiler.

## The GPU-Accelerated R Software Stack

Figure 1 shows that there are two ways to apply the computational power of GPUs in R:

1. use R GPU packages from CRAN; or
2. access the GPU through CUDA libraries and/or CUDA-accelerated programming languages, including C, C++ and Fortran.
Figure 1: The R + GPU software stack.

The first approach is to use existing GPU-accelerated R packages listed under High-Performance and Parallel Computing with R on the CRAN site. Examples include gputools and cudaBayesreg. These packages are very easy to install and use. On the other hand, the number of GPU packages is currently limited, quality varies, and only a few domains are covered. This will improve with time.

The second approach is to use the GPU through CUDA directly. Continue reading

# Calling CUDA-accelerated Libraries from MATLAB: A Computer Vision Example

In an earlier post we showed how MATLAB® can support CUDA kernel prototyping and development by providing an environment for quick evaluation and visualization using the CUDAKernel object. In this post I will show you how to integrate an existing library of both host and device code implemented in C++ or another CUDA-accelerated language using MEX. With MEX you can extend and customize MATLAB, or use MATLAB as a test environment for your production code.

The MATLAB MEX compiler allows you to expose your libraries to the MATLAB environment as functions. You write your entry point in C, C++ or Fortran as a modified main() function which MATLAB invokes. MEX provides a framework for compiling this code, as well as an API for interacting with MATLAB and MATLAB data in your source code.

MATLAB’s Parallel Computing Toolbox™ provides constructs for compiling CUDA C and C++ with nvcc, and new APIs for accessing and using the gpuArray datatype which represents data stored on the GPU as a numeric array in the MATLAB workspace.

## Feature Detection Example

I am going to use a feature detection example from MATLAB’s documentation for Computer Vision System Toolbox™. This uses tracked features to remove camera shake from an in-car road video. You will need MATLAB®, Parallel Computing Toolbox™, Image Processing Toolbox™ and Computer Vision System Toolbox™ to run the code. You can request a trial of these products at www.mathworks.com/trial. This example also depends on the OpenCV Computer Vision library, compiled with CUDA support.

Features are an essential prerequisite for many Computer Vision tasks; in this case, for instance, they might also be used to determine the motion of the car or to track other cars on the road.

To set up the example environment, I am using the following MATLAB code to load the video and display the first two frames superimposed (Figure 1). Continue reading

# CUDA Spotlight: Women and CUDA

Today we are pleased to launch the new Women and CUDA website.

We received a wide variety of entries from around the world, representing professors, students, researchers, scientists and domain experts. We had recognized several participants earlier as CUDA Spotlights, including Valerie Halyo of Princeton and Monica Syal of Advanced Rotorcraft Technology.

CUDA Fellow Lorena Barba comments: “This is a good way to remind people that women write code, participate in open-source projects, and invent things. It’s important to make the technology world more attractive to female students.” Dr. Barba is an associate professor of Engineering and Applied Science at George Washington University.

Congrats to the women on our inaugural Women and CUDA list:

Sonia Lopez Alarcon, Rochester Institute of Technology | Heterogeneous Computing
Rommie Amaro, University of California, San Diego | Biological Systems
Michela Becchi, University of Missouri | GPU Virtualization Continue reading

# Accelerating Graph Betweenness Centrality with CUDA

Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph algorithms involve tremendously large networks that cannot be inspected manually. Betweenness Centrality (BC) is a popular analytic that determines vertex influence in a graph. It has many practical use cases, including finding the best locations for stores within cities, power grid contingency analysis, and community detection. Unfortunately, the fastest known algorithm for computing betweenness centrality has $O(mn)$ time complexity for graphs with $n$ vertices and $m$ edges, making the analysis of large networks challenging.

This post describes how we used CUDA and NVIDIA GPUs to accelerate the BC computation, and how choosing efficient parallelization strategies results in an average speedup of 2.7x, and more than 10x speedup for road networks and meshes versus a naïve edge-parallel strategy.

Betweenness Centrality determines the importance of vertices in a network by measuring the ratio of shortest paths passing through a particular vertex to the total number of shortest paths between all pairs of vertices. Intuitively, this ratio determines how well a vertex connects pairs of vertices in the network. Formally, the Betweenness Centrality of a vertex $v$ is defined as:

$BC(v) = \sum_{s \neq t \neq v} \frac{\sigma_{st}(v)}{\sigma_{st}}$

where $\sigma_{st}$ is the number of shortest paths between vertices $s$ and $t$ and $\sigma_{st}(v)$ is the number of those shortest paths that pass through $v$. Consider Figure 1 above. Vertex 4 is the only vertex that lies on paths from its left (vertices 5 through 9) to its right (vertices 1 through 3). Hence vertex 4 lies on all the shortest paths between these pairs of vertices and has a high BC score. In contrast, vertex 9 does not belong on a path between any pair of the remaining vertices and thus it has a BC score of 0. Continue reading

# CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.

Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.

CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage.

# Low-Power Sensing and Autonomy With NVIDIA Jetson TK1

NVIDIA’s Tegra K1 (TK1) is the first ARM system-on-chip (SoC) with integrated CUDA.  With 192 Kepler GPU cores and four ARM Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance, TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power (including the SoC and DRAM).  This brings game-changing performance to low-SWaP (Size, Weight and Power) and small form factor (SFF) applications in the sub-10W domain, all the while supporting a developer-friendly Ubuntu Linux software environment delivering an experience more like that of a desktop rather than an embedded SoC.

Tegra K1 is plug-and-play and can stream high-bandwidth peripherals, sensors, and network interfaces via built-in USB 3.0 and PCIe gen2 x4/x1 ports.  TK1 is geared for sensor processing and offers additional hardware-accelerated functionality asynchronous to CUDA, like H.264 encoding and decoding engines and dual MIPI CSI-2 camera interfaces and image service processors (ISP).  There are many exciting embedded applications for TK1 which leverage its natural ability as a media processor and low-power platform for quickly integrating devices and sensors.

As GPU acceleration is particularly well-suited for data-parallel tasks like imaging, signal processing, autonomy and machine learning, Tegra K1 extends these capabilities into the sub-10W domain.  Code portability is now maintained from NVIDIA’s high-end Tesla HPC accelerators and the GeForce and Quadro discrete GPUs, all the way down through the low-power TK1.   A full build of the CUDA 6 toolkit is available for TK1, including samples, math libraries such as cuFFT, cuBLAS, and NPP, and NVIDIA’s NVCC compiler.  Developers can compile CUDA code natively on TK1 or cross-compile from a Linux development machine.  Availability of the CUDA libraries and development tools ensures seamless and effortless scalability between deploying CUDA applications on discrete GPUs and on Tegra.  There’s also OpenCV4Tegra available as well as NVIDIA’s VisionWorks toolkit.  Additionally the Ubuntu 14.04 repository is rich in pre-built packages for the ARM architecture, minimizing time spent tracking down and building dependencies.  In many instances applications can be simply recompiled for ARM with little modification, as long as source is available and doesn’t explicitly call out x86-specific instructions like SSE, AVX, or x86-ASM. NEON is ARM’s version of SIMD extensions for Cortex-A series CPUs.

# CUDA Pro Tip: Profiling MPI Applications

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option --display-map, it’s tedious and error prone. A solution which solves this for the command-line output of nvprof is described here http://www.parallel-computing.pro/index.php/9-cuda/5-sorting-cuda-profiler-output-of-the-mpi-cuda-program . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler (nvvp).

## Profiling MPI applications with nvprof and nvvp

### Collecting data with nvprof

nvprof supports dumping the profile to a file which can be later imported into nvvp. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “%p” in the output file name. nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string “%q{ENV}” can be used to name the output file of nvprof. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (mpirun or mpiexec). E.g. for OpenMPI OMPI_COMM_WORLD_RANK is set to the MPI rank for each launched process.

\$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./simpleMPI
Running on 2 nodes
==18811== NVPROF is profiling process 18811, command: ./simpleMPI
==18813== NVPROF is profiling process 18813, command: ./simpleMPI
Average of square roots is: 0.667279
PASSED
==18813== Generated result file: simpleMPI.1.nvprof
==18811== Generated result file: simpleMPI.0.nvprof

# CUDA Spotlight: GPU-Accelerated Nanotechnology

Our Spotlight is on Dr. Mark Bathe, Associate Professor of Biological Engineering at the Massachusetts Institute of Technology.

Mark’s lab focuses on in silico design and programming of synthetic nucleic acid scaffolds for engineering light-harvesting antennas, multi-enzyme cascades, cellular delivery vehicles, and fluorescent biomolecular probes, which he assays using innovative quantitative imaging techniques.

The following is an excerpt from our interview (read the complete Spotlight here).

_____________________________________

NVIDIA: Mark, tell us about your work with structural nucleic acids and DNA nanotechnology.
Mark: DNA is best known to us as the molecule of life: It stores our genetic information and transmits that information from generation to generation.

A lesser known, powerful alternative use for DNA is that of a programmable structural element for engineering molecular scaffolds of precise shape and size at the nanometer-scale.

This molecular engineering paradigm dates back to early work by Nadrian Seeman in the 1980s, when he demonstrated theoretically that DNA could be programmed to form large-scale synthetic assemblies due to its unique and highly specific basepairing properties.

Since that landmark work, the field of molecular engineering using nucleic acids has witnessed explosive growth. Unlike proteins, DNA is highly programmable structurally because it can be designed to robustly self-assemble into large-scale molecular architectures of precise nanometer-scale structural features, dimensions, and mechanical properties.

These assemblies can subsequently be functionalized chemically using lipids, dyes, and proteins for diverse applications in biomolecular science and technology.

The rapidly decreasing cost of synthetic DNA, together with rational computational design rules, now enable a plethora of structured nanoscale materials to be designed, with the ultimate aim of replicating the function of biological protein assemblies that have evolved over billions of years.