# Calling CUDA-accelerated Libraries from MATLAB: A Computer Vision Example

In an earlier post we showed how MATLAB® can support CUDA kernel prototyping and development by providing an environment for quick evaluation and visualization using the CUDAKernel object. In this post I will show you how to integrate an existing library of both host and device code implemented in C++ or another CUDA-accelerated language using MEX. With MEX you can extend and customize MATLAB, or use MATLAB as a test environment for your production code.

The MATLAB MEX compiler allows you to expose your libraries to the MATLAB environment as functions. You write your entry point in C, C++ or Fortran as a modified main() function which MATLAB invokes. MEX provides a framework for compiling this code, as well as an API for interacting with MATLAB and MATLAB data in your source code.

MATLAB’s Parallel Computing Toolbox™ provides constructs for compiling CUDA C and C++ with nvcc, and new APIs for accessing and using the gpuArray datatype which represents data stored on the GPU as a numeric array in the MATLAB workspace.

## Feature Detection Example

I am going to use a feature detection example from MATLAB’s documentation for Computer Vision System Toolbox™. This uses tracked features to remove camera shake from an in-car road video. You will need MATLAB®, Parallel Computing Toolbox™, Image Processing Toolbox™ and Computer Vision System Toolbox™ to run the code. You can request a trial of these products at www.mathworks.com/trial. This example also depends on the OpenCV Computer Vision library, compiled with CUDA support.

Features are an essential prerequisite for many Computer Vision tasks; in this case, for instance, they might also be used to determine the motion of the car or to track other cars on the road.

To set up the example environment, I am using the following MATLAB code to load the video and display the first two frames superimposed (Figure 1). Continue reading

# CUDA Spotlight: Women and CUDA

Today we are pleased to launch the new Women and CUDA website.

We received a wide variety of entries from around the world, representing professors, students, researchers, scientists and domain experts. We had recognized several participants earlier as CUDA Spotlights, including Valerie Halyo of Princeton and Monica Syal of Advanced Rotorcraft Technology.

CUDA Fellow Lorena Barba comments: “This is a good way to remind people that women write code, participate in open-source projects, and invent things. It’s important to make the technology world more attractive to female students.” Dr. Barba is an associate professor of Engineering and Applied Science at George Washington University.

Congrats to the women on our inaugural Women and CUDA list:

Sonia Lopez Alarcon, Rochester Institute of Technology | Heterogeneous Computing
Rommie Amaro, University of California, San Diego | Biological Systems
Michela Becchi, University of Missouri | GPU Virtualization Continue reading

# Accelerating Graph Betweenness Centrality with CUDA

Graph analysis is a fundamental tool for domains as diverse as social networks, computational biology, and machine learning. Real-world applications of graph algorithms involve tremendously large networks that cannot be inspected manually. Betweenness Centrality (BC) is a popular analytic that determines vertex influence in a graph. It has many practical use cases, including finding the best locations for stores within cities, power grid contingency analysis, and community detection. Unfortunately, the fastest known algorithm for computing betweenness centrality has $O(mn)$ time complexity for graphs with $n$ vertices and $m$ edges, making the analysis of large networks challenging.

This post describes how we used CUDA and NVIDIA GPUs to accelerate the BC computation, and how choosing efficient parallelization strategies results in an average speedup of 2.7x, and more than 10x speedup for road networks and meshes versus a naïve edge-parallel strategy.

Betweenness Centrality determines the importance of vertices in a network by measuring the ratio of shortest paths passing through a particular vertex to the total number of shortest paths between all pairs of vertices. Intuitively, this ratio determines how well a vertex connects pairs of vertices in the network. Formally, the Betweenness Centrality of a vertex $v$ is defined as:

$BC(v) = \sum_{s \neq t \neq v} \frac{\sigma_{st}(v)}{\sigma_{st}}$

where $\sigma_{st}$ is the number of shortest paths between vertices $s$ and $t$ and $\sigma_{st}(v)$ is the number of those shortest paths that pass through $v$. Consider Figure 1 above. Vertex 4 is the only vertex that lies on paths from its left (vertices 5 through 9) to its right (vertices 1 through 3). Hence vertex 4 lies on all the shortest paths between these pairs of vertices and has a high BC score. In contrast, vertex 9 does not belong on a path between any pair of the remaining vertices and thus it has a BC score of 0. Continue reading

# Low-Power Sensing and Autonomy With NVIDIA Jetson TK1

NVIDIA’s Tegra K1 (TK1) is the first ARM system-on-chip (SoC) with integrated CUDA.  With 192 Kepler GPU cores and four ARM Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance, TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power (including the SoC and DRAM).  This brings game-changing performance to low-SWaP and small form factor (SFF) applications in the sub-10W domain, all the while supporting a developer-friendly Ubuntu Linux software environment delivering an experience more like that of a desktop rather than an embedded SoC.  Tegra K1 is plug-and-play and can stream high-bandwidth peripherals, sensors, and network interfaces via built-in USB 3.0 and PCIe gen2 x4/x1 ports.  TK1 is geared for sensor processing and offers additional hardware-accelerated functionality asynchronous to CUDA, like H.264 encoding and decoding engines and dual MIPI CSI-2 camera interfaces and image service processors (ISP).  There are many exciting embedded applications for TK1 which leverage its natural ability as a media processor and low-power platform for quickly integrating devices and sensors.

As GPU acceleration is particularly well-suited for data-parallel tasks like imaging, signal processing, autonomy and machine learning, Tegra K1 extends these capabilities into the sub-10W domain.  Code portability is now maintained from NVIDIA’s high-end Tesla HPC accelerators and the GeForce and Quadro discrete GPUs, all the way down through the low-power TK1.   A full build of the CUDA 6 toolkit is available for TK1, including samples, math libraries such as cuFFT, cuBLAS, and NPP, and NVIDIA’s NVCC compiler.  Developers can compile CUDA code natively on TK1 or cross-compile from a Linux development machine.  Availability of the CUDA libraries and development tools ensures seamless and effortless scalability between deploying CUDA applications on discrete GPUs and on Tegra.  There’s also OpenCV4Tegra available as well as NVIDIA’s VisionWorks toolkit.  Additionally the Ubuntu 14.04 repository is rich in pre-built packages for the ARM architecture, minimizing time spent tracking down and building dependencies.  In many instances applications can be simply recompiled for ARM with little modification, as long as source is available and doesn’t explicitly call out x86-specific instructions like SSE, AVX, or x86-ASM. NEON is ARM’s version of SIMD extensions for Cortex-A series CPUs.

# CUDA Pro Tip: Profiling MPI Applications

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option --display-map, it’s tedious and error prone. A solution which solves this for the command-line output of nvprof is described here http://www.parallel-computing.pro/index.php/9-cuda/5-sorting-cuda-profiler-output-of-the-mpi-cuda-program . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler (nvvp).

## Profiling MPI applications with nvprof and nvvp

### Collecting data with nvprof

nvprof supports dumping the profile to a file which can be later imported into nvvp. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “%p” in the output file name. nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string “%q{ENV}” can be used to name the output file of nvprof. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (mpirun or mpiexec). E.g. for OpenMPI OMPI_COMM_WORLD_RANK is set to the MPI rank for each launched process.

\$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./simpleMPI
Running on 2 nodes
==18811== NVPROF is profiling process 18811, command: ./simpleMPI
==18813== NVPROF is profiling process 18813, command: ./simpleMPI
Average of square roots is: 0.667279
PASSED
==18813== Generated result file: simpleMPI.1.nvprof
==18811== Generated result file: simpleMPI.0.nvprof

# CUDA Spotlight: GPU-Accelerated Nanotechnology

Our Spotlight is on Dr. Mark Bathe, Associate Professor of Biological Engineering at the Massachusetts Institute of Technology.

Mark’s lab focuses on in silico design and programming of synthetic nucleic acid scaffolds for engineering light-harvesting antennas, multi-enzyme cascades, cellular delivery vehicles, and fluorescent biomolecular probes, which he assays using innovative quantitative imaging techniques.

The following is an excerpt from our interview (read the complete Spotlight here).

_____________________________________

NVIDIA: Mark, tell us about your work with structural nucleic acids and DNA nanotechnology.
Mark: DNA is best known to us as the molecule of life: It stores our genetic information and transmits that information from generation to generation.

A lesser known, powerful alternative use for DNA is that of a programmable structural element for engineering molecular scaffolds of precise shape and size at the nanometer-scale.

This molecular engineering paradigm dates back to early work by Nadrian Seeman in the 1980s, when he demonstrated theoretically that DNA could be programmed to form large-scale synthetic assemblies due to its unique and highly specific basepairing properties.

Since that landmark work, the field of molecular engineering using nucleic acids has witnessed explosive growth. Unlike proteins, DNA is highly programmable structurally because it can be designed to robustly self-assemble into large-scale molecular architectures of precise nanometer-scale structural features, dimensions, and mechanical properties.

These assemblies can subsequently be functionalized chemically using lipids, dyes, and proteins for diverse applications in biomolecular science and technology.

The rapidly decreasing cost of synthetic DNA, together with rational computational design rules, now enable a plethora of structured nanoscale materials to be designed, with the ultimate aim of replicating the function of biological protein assemblies that have evolved over billions of years.

# Drop-in Acceleration of GNU Octave

cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs.  However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API:

• a CUDA context first needs to be created
• a cuBLAS handle needs to be initialized
• all relevant data needs to be copied to preallocated GPU memory, followed by deallocation after the computation

Such an API permits the fine tuning required to minimize redundant data copies to and from the GPU in arbitrarily complicated scenarios such that maximum performance is achieved.  But it is less convenient when just a few BLAS routines need to be accelerated (simple data copy) or when vast amounts of code need to be modified (large programmer effort).  In these cases it would be useful to have an API which managed the data transfer to and from the GPU automatically and could be used as a direct replacement for CPU BLAS libraries.

Additionally, there is the common case where the input matrices to the BLAS operations are too large to fit on the GPU.  While using the cuBLAS API to write a tiled BLAS implementation (which achieves even higher performance) is straightforward, a GPU BLAS library which implemented and managed such tiling in a near optimal way would certainly facilitate access to the computing power of the GPU.

To address these issues, CUDA 6 adds new Multi-GPU extensions, implemented for the most compute intensive BLAS Level 3 routines. They are called cuBLAS-XT and can work directly with host data, removing the need to manually allocate and copy data to the GPU’s memory. NVBLAS is a dynamic library built on top of these extensions which offers a transparent BLAS Level 3 acceleration with zero coding effort.  That is, CPU BLAS libraries can be directly replaced with NVBLAS.  As such, NVBLAS can be used to easily accelerate any application which uses level-3 BLAS routines.

# CUDA Pro Tip: Minimize the Tail Effect

When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the ratio between the number of threads which may run on each multiprocessor (SM) and the maximum number of executable threads per SM (2048 on the Kepler architecture). This value is estimated from the size of the blocks and the amount of resources (registers and shared memory) used by those blocks for a particular GPU and is computed without running the kernel on the GPU. The Achieved Occupancy, on the other hand, is measured from the execution of the kernel (as the number of active warps divided by the number of active cycles compared to the maximum number of executable warps).

Recently, while working on a kernel for a finance benchmark, I could see an Achieved Occupancy of 41.52% whereas the Theoretical Occupancy was 50%. In NVIDIA Nsight Visual Studio Edition, the Instruction per Clock (IPC) showed a lot of load imbalance between the different SMs with respect to the number of executed instructions by the kernel (see the left graph in the figure below).

# CUDA Spotlight: GPU-Accelerated Deep Learning

Our Spotlight is on Dr. Ren Wu, a distinguished scientist at Baidu’s Institute of Deep Learning (IDL).

He is known for his pioneering research in using GPUs to accelerate big data analytics and his contribution to large-scale clustering algorithms via the GPU. Ren was a speaker at GTC14 and was originally featured as a CUDA Spotlight in 2011 when he worked at HP Labs.

[Editor's note: On May 16, Baidu announced the hiring of Dr. Andrew Ng to lead Baidu's Silicon Valley Research Lab.]

The following is an excerpt from our interview (read the complete Spotlight here).
______________________________________________________________________________

NVIDIA: Ren, why is GPU computing important to your work?
Ren: A key factor in the progress we are making with deep learning is that we now have much greater computing resources in our hands.

Today one or two workstations with a few GPUs has the same computing power as the fastest supercomputer in the world 15 years ago, thanks to GPU computing and NVIDIA’s vision.