# CUDA Spotlight: Women and CUDA

Today we are pleased to launch the new Women and CUDA website.

We received a wide variety of entries from around the world, representing professors, students, researchers, scientists and domain experts. We had recognized several participants earlier as CUDA Spotlights, including Valerie Halyo of Princeton and Monica Syal of Advanced Rotorcraft Technology.

CUDA Fellow Lorena Barba comments: “This is a good way to remind people that women write code, participate in open-source projects, and invent things. It’s important to make the technology world more attractive to female students.” Dr. Barba is an associate professor of Engineering and Applied Science at George Washington University.

Congrats to the women on our inaugural Women and CUDA list:

Sonia Lopez Alarcon, Rochester Institute of Technology | Heterogeneous Computing
Rommie Amaro, University of California, San Diego | Biological Systems
Michela Becchi, University of Missouri | GPU Virtualization Continue reading

# Accelerating Graph Betweenness Centrality with CUDA

Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph algorithms involve tremendously large networks that cannot be inspected manually. Betweenness Centrality (BC) is a popular analytic that determines vertex influence in a graph. It has many practical use cases, including finding the best locations for stores within cities, power grid contingency analysis, and community detection. Unfortunately, the fastest known algorithm for computing betweenness centrality has $O(mn)$ time complexity for graphs with $n$ vertices and $m$ edges, making the analysis of large networks challenging.

This post describes how we used CUDA and NVIDIA GPUs to accelerate the BC computation, and how choosing efficient parallelization strategies results in an average speedup of 2.7x, and more than 10x speedup for road networks and meshes versus a naïve edge-parallel strategy.

Betweenness Centrality determines the importance of vertices in a network by measuring the ratio of shortest paths passing through a particular vertex to the total number of shortest paths between all pairs of vertices. Intuitively, this ratio determines how well a vertex connects pairs of vertices in the network. Formally, the Betweenness Centrality of a vertex $v$ is defined as:

$BC(v) = \sum_{s \neq t \neq v} \frac{\sigma_{st}(v)}{\sigma_{st}}$

where $\sigma_{st}$ is the number of shortest paths between vertices $s$ and $t$ and $\sigma_{st}(v)$ is the number of those shortest paths that pass through $v$. Consider Figure 1 above. Vertex 4 is the only vertex that lies on paths from its left (vertices 5 through 9) to its right (vertices 1 through 3). Hence vertex 4 lies on all the shortest paths between these pairs of vertices and has a high BC score. In contrast, vertex 9 does not belong on a path between any pair of the remaining vertices and thus it has a BC score of 0. Continue reading

# CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

CUDA programmers often need to decide on a block size to use for a kernel launch. For key kernels, its important to understand the constraints of the kernel and the GPU it is running on to choose a block size that will result in good performance. One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. Higher occupancy does not always mean higher performance, but it is a useful metric for gauging the latency hiding ability of a kernel.

Before CUDA 6.5, calculating occupancy was tricky. It required implementing a complex computation that took account of the present GPU and its capabilities (including register file and shared memory size), and the properties of the kernel (shared memory usage, registers per thread, threads per block). Implementating the occupancy calculation is difficult, so very few programmers take this approach, instead using the occupancy calculator spreadsheet included with the CUDA Toolkit to find good block sizes for each supported GPU architecture.

CUDA 6.5 includes several new runtime functions to aid in occupancy calculations and launch configuration. The core occupancy calculator API, cudaOccupancyMaxActiveBlocksPerMultiprocessor produces an occupancy prediction based on the block size and shared memory usage of a kernel. This function reports occupancy in terms of the number of concurrent thread blocks per multiprocessor. Note that this value can be converted to other metrics. Multiplying by the number of warps per block yields the number of concurrent warps per multiprocessor; further dividing concurrent warps by max warps per multiprocessor gives the occupancy as a percentage. Continue reading

# Low-Power Sensing and Autonomy With NVIDIA Jetson TK1

NVIDIA’s Tegra K1 (TK1) is the first ARM system-on-chip (SoC) with integrated CUDA.  With 192 Kepler GPU cores and four ARM Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance, TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power (including the SoC and DRAM).  This brings game-changing performance to low-SWaP and small form factor (SFF) applications in the sub-10W domain, all the while supporting a developer-friendly Ubuntu Linux software environment delivering an experience more like that of a desktop rather than an embedded SoC.  Tegra K1 is plug-and-play and can stream high-bandwidth peripherals, sensors, and network interfaces via built-in USB 3.0 and PCIe gen2 x4/x1 ports.  TK1 is geared for sensor processing and offers additional hardware-accelerated functionality asynchronous to CUDA, like H.264 encoding and decoding engines and dual MIPI CSI-2 camera interfaces and image service processors (ISP).  There are many exciting embedded applications for TK1 which leverage its natural ability as a media processor and low-power platform for quickly integrating devices and sensors.

As GPU acceleration is particularly well-suited for data-parallel tasks like imaging, signal processing, autonomy and machine learning, Tegra K1 extends these capabilities into the sub-10W domain.  Code portability is now maintained from NVIDIA’s high-end Tesla HPC accelerators and the GeForce and Quadro discrete GPUs, all the way down through the low-power TK1.   A full build of the CUDA 6 toolkit is available for TK1, including samples, math libraries such as cuFFT, cuBLAS, and NPP, and NVIDIA’s NVCC compiler.  Developers can compile CUDA code natively on TK1 or cross-compile from a Linux development machine.  Availability of the CUDA libraries and development tools ensures seamless and effortless scalability between deploying CUDA applications on discrete GPUs and on Tegra.  There’s also OpenCV4Tegra available as well as NVIDIA’s VisionWorks toolkit.  Additionally the Ubuntu 14.04 repository is rich in pre-built packages for the ARM architecture, minimizing time spent tracking down and building dependencies.  In many instances applications can be simply recompiled for ARM with little modification, as long as source is available and doesn’t explicitly call out x86-specific instructions like SSE, AVX, or x86-ASM. NEON is ARM’s version of SIMD extensions for Cortex-A series CPUs.

# CUDA Pro Tip: Profiling MPI Applications

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option --display-map, it’s tedious and error prone. A solution which solves this for the command-line output of nvprof is described here http://www.parallel-computing.pro/index.php/9-cuda/5-sorting-cuda-profiler-output-of-the-mpi-cuda-program . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler (nvvp).

## Profiling MPI applications with nvprof and nvvp

### Collecting data with nvprof

nvprof supports dumping the profile to a file which can be later imported into nvvp. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “%p” in the output file name. nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string “%q{ENV}” can be used to name the output file of nvprof. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (mpirun or mpiexec). E.g. for OpenMPI OMPI_COMM_WORLD_RANK is set to the MPI rank for each launched process.

\$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./simpleMPI
Running on 2 nodes
==18811== NVPROF is profiling process 18811, command: ./simpleMPI
==18813== NVPROF is profiling process 18813, command: ./simpleMPI
Average of square roots is: 0.667279
PASSED
==18813== Generated result file: simpleMPI.1.nvprof
==18811== Generated result file: simpleMPI.0.nvprof

# CUDA Spotlight: GPU-Accelerated Nanotechnology

Our Spotlight is on Dr. Mark Bathe, Associate Professor of Biological Engineering at the Massachusetts Institute of Technology.

Mark’s lab focuses on in silico design and programming of synthetic nucleic acid scaffolds for engineering light-harvesting antennas, multi-enzyme cascades, cellular delivery vehicles, and fluorescent biomolecular probes, which he assays using innovative quantitative imaging techniques.

The following is an excerpt from our interview (read the complete Spotlight here).

_____________________________________

NVIDIA: Mark, tell us about your work with structural nucleic acids and DNA nanotechnology.
Mark: DNA is best known to us as the molecule of life: It stores our genetic information and transmits that information from generation to generation.

A lesser known, powerful alternative use for DNA is that of a programmable structural element for engineering molecular scaffolds of precise shape and size at the nanometer-scale.

This molecular engineering paradigm dates back to early work by Nadrian Seeman in the 1980s, when he demonstrated theoretically that DNA could be programmed to form large-scale synthetic assemblies due to its unique and highly specific basepairing properties.

Since that landmark work, the field of molecular engineering using nucleic acids has witnessed explosive growth. Unlike proteins, DNA is highly programmable structurally because it can be designed to robustly self-assemble into large-scale molecular architectures of precise nanometer-scale structural features, dimensions, and mechanical properties.

These assemblies can subsequently be functionalized chemically using lipids, dyes, and proteins for diverse applications in biomolecular science and technology.

The rapidly decreasing cost of synthetic DNA, together with rational computational design rules, now enable a plethora of structured nanoscale materials to be designed, with the ultimate aim of replicating the function of biological protein assemblies that have evolved over billions of years.

# A CUDA Dynamic Parallelism Case Study: PANDA

This post concludes an introductory series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. The second post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. In this post, I finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment part of the (Facility for Antiproton and Ion Research in Europe (FAIR)). The PANDA work was carried out in the scope of the NVIDIA Application Lab at Jülich.

## The PANDA Experiment

PANDA (= anti-Proton ANnihilation at DArmstadt) is a state-of-the-art hadron particle physics experiment currently under construction at FAIR (Facility for Anti-proton and Ion Research) at Darmstadt. It is scheduled to start operation in 2019.

Inside the PANDA experiment, accelerated antiprotons will collide with protons, forming intermediate and unstable particles (mesons, baryons etc.), which will decay in cascades into stable particles, like electrons and photons. The unstable particles are of particular interest for PANDA, as they give insight into the processes governing this physics regime (QCD). Reconstructing all involved constituent particles of an event lets the physicists form a picture of the process, eventually confirming established physics theories, probing new ones and potentially finding exciting and unexpected results.

# Drop-in Acceleration of GNU Octave

cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs.  However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API:

• a CUDA context first needs to be created
• a cuBLAS handle needs to be initialized
• all relevant data needs to be copied to preallocated GPU memory, followed by deallocation after the computation

Such an API permits the fine tuning required to minimize redundant data copies to and from the GPU in arbitrarily complicated scenarios such that maximum performance is achieved.  But it is less convenient when just a few BLAS routines need to be accelerated (simple data copy) or when vast amounts of code need to be modified (large programmer effort).  In these cases it would be useful to have an API which managed the data transfer to and from the GPU automatically and could be used as a direct replacement for CPU BLAS libraries.

Additionally, there is the common case where the input matrices to the BLAS operations are too large to fit on the GPU.  While using the cuBLAS API to write a tiled BLAS implementation (which achieves even higher performance) is straightforward, a GPU BLAS library which implemented and managed such tiling in a near optimal way would certainly facilitate access to the computing power of the GPU.

To address these issues, CUDA 6 adds new Multi-GPU extensions, implemented for the most compute intensive BLAS Level 3 routines. They are called cuBLAS-XT and can work directly with host data, removing the need to manually allocate and copy data to the GPU’s memory. NVBLAS is a dynamic library built on top of these extensions which offers a transparent BLAS Level 3 acceleration with zero coding effort.  That is, CPU BLAS libraries can be directly replaced with NVBLAS.  As such, NVBLAS can be used to easily accelerate any application which uses level-3 BLAS routines.

# CUDA Pro Tip: Minimize the Tail Effect

When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the ratio between the number of threads which may run on each multiprocessor (SM) and the maximum number of executable threads per SM (2048 on the Kepler architecture). This value is estimated from the size of the blocks and the amount of resources (registers and shared memory) used by those blocks for a particular GPU and is computed without running the kernel on the GPU. The Achieved Occupancy, on the other hand, is measured from the execution of the kernel (as the number of active warps divided by the number of active cycles compared to the maximum number of executable warps).

Recently, while working on a kernel for a finance benchmark, I could see an Achieved Occupancy of 41.52% whereas the Theoretical Occupancy was 50%. In NVIDIA Nsight Visual Studio Edition, the Instruction per Clock (IPC) showed a lot of load imbalance between the different SMs with respect to the number of executed instructions by the kernel (see the left graph in the figure below).

# Accelerating a C++ CFD code with OpenACC

Computational Fluid Dynamics (CFD) is a valuable tool to study the behavior of fluids. Today, many areas of engineering use CFD. For example, the automotive industry uses CFD to study airflow around cars, and to optimize the car body shapes to reduce drag and improve fuel efficiency. To get accurate results in fluid simulation it is necessary to capture complex phenomena such as turbulence, which requires very accurate models. These complex models result in very long computing times. In this post I describe how I used OpenACC to accelerate the ZFS C++ CFD solver with NVIDIA Tesla GPUs.

# The ZFS flow solver

The C++ flow solver ZFS (Zonal Flow Solver) is developed at the Institute of Aerodynamics at RWTH Aachen, Germany. ZFS solves the unsteady Navier-Stokes equations for compressible flows on automatically generated hierarchical Cartesian grids with a fully-conservative second-order-accurate finite-volume method [1, 2, 3]. To integrate the flow equations in time ZFS uses a 5-step Runge-Kutta method with dual time stepping [2]. It imposes boundary conditions using a ghost-cell method [4] that can handle multiple ghost cells [5, 6]. ZFS supports complex moving boundaries which are sharply discretized using a cut-cell type immersed-boundary method [1, 2, 7].

Among other topics, scientists have used ZFS to study the flow within an internal combustion engine with moving pistons and valves, as Figure 1 shows. Figure 2 shows how the Lattice-Boltzmann solver in ZFS was used to better understand airflow within the human nasal cavity.