Figure 4: MMTI and trainable HoG pedestrian/vehicle detectors extract dynamic obstacles from HD video at runtime

Low-Power Sensing and Autonomy With NVIDIA Jetson TK1

Figure 1: simple TK1 block diagram
Figure 1: simple TK1 block diagram

NVIDIA’s Tegra K1 (TK1) is the first ARM system-on-chip (SoC) with integrated CUDA.  With 192 Kepler GPU cores and four ARM Cortex-A15 cores delivering a total of 327 GFLOPS of compute performance, TK1 has the capacity to process lots of data with CUDA while typically drawing less than 6W of power (including the SoC and DRAM).  This brings game-changing performance to low-SWaP (Size, Weight and Power) and small form factor (SFF) applications in the sub-10W domain, all the while supporting a developer-friendly Ubuntu Linux software environment delivering an experience more like that of a desktop rather than an embedded SoC.

Tegra K1 is plug-and-play and can stream high-bandwidth peripherals, sensors, and network interfaces via built-in USB 3.0 and PCIe gen2 x4/x1 ports.  TK1 is geared for sensor processing and offers additional hardware-accelerated functionality asynchronous to CUDA, like H.264 encoding and decoding engines and dual MIPI CSI-2 camera interfaces and image service processors (ISP).  There are many exciting embedded applications for TK1 which leverage its natural ability as a media processor and low-power platform for quickly integrating devices and sensors.

As GPU acceleration is particularly well-suited for data-parallel tasks like imaging, signal processing, autonomy and machine learning, Tegra K1 extends these capabilities into the sub-10W domain.  Code portability is now maintained from NVIDIA’s high-end Tesla HPC accelerators and the GeForce and Quadro discrete GPUs, all the way down through the low-power TK1.   A full build of the CUDA 6 toolkit is available for TK1, including samples, math libraries such as cuFFT, cuBLAS, and NPP, and NVIDIA’s NVCC compiler.  Developers can compile CUDA code natively on TK1 or cross-compile from a Linux development machine.  Availability of the CUDA libraries and development tools ensures seamless and effortless scalability between deploying CUDA applications on discrete GPUs and on Tegra.  There’s also OpenCV4Tegra available as well as NVIDIA’s VisionWorks toolkit.  Additionally the Ubuntu 14.04 repository is rich in pre-built packages for the ARM architecture, minimizing time spent tracking down and building dependencies.  In many instances applications can be simply recompiled for ARM with little modification, as long as source is available and doesn’t explicitly call out x86-specific instructions like SSE, AVX, or x86-ASM. NEON is ARM’s version of SIMD extensions for Cortex-A series CPUs.
Continue reading

cuda_pro_tip

CUDA Pro Tip: Profiling MPI Applications

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option --display-map, it’s tedious and error prone. A solution which solves this for the command-line output of nvprof is described here http://www.parallel-computing.pro/index.php/9-cuda/5-sorting-cuda-profiler-output-of-the-mpi-cuda-program . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler (nvvp).

Profiling MPI applications with nvprof and nvvp

Collecting data with nvprof

nvprof supports dumping the profile to a file which can be later imported into nvvp. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “%p” in the output file name. nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string “%q{ENV}” can be used to name the output file of nvprof. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (mpirun or mpiexec). E.g. for OpenMPI OMPI_COMM_WORLD_RANK is set to the MPI rank for each launched process.

$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./simpleMPI
Running on 2 nodes
==18811== NVPROF is profiling process 18811, command: ./simpleMPI
==18813== NVPROF is profiling process 18813, command: ./simpleMPI
Average of square roots is: 0.667279
PASSED
==18813== Generated result file: simpleMPI.1.nvprof
==18811== Generated result file: simpleMPI.0.nvprof

Continue reading

cuda_spotlight

CUDA Spotlight: GPU-Accelerated Nanotechnology

bathe-mark

Our Spotlight is on Dr. Mark Bathe, Associate Professor of Biological Engineering at the Massachusetts Institute of Technology.

Mark’s lab focuses on in silico design and programming of synthetic nucleic acid scaffolds for engineering light-harvesting antennas, multi-enzyme cascades, cellular delivery vehicles, and fluorescent biomolecular probes, which he assays using innovative quantitative imaging techniques.

The following is an excerpt from our interview (read the complete Spotlight here).

_____________________________________

NVIDIA: Mark, tell us about your work with structural nucleic acids and DNA nanotechnology.
Mark: DNA is best known to us as the molecule of life: It stores our genetic information and transmits that information from generation to generation.

A lesser known, powerful alternative use for DNA is that of a programmable structural element for engineering molecular scaffolds of precise shape and size at the nanometer-scale.

This molecular engineering paradigm dates back to early work by Nadrian Seeman in the 1980s, when he demonstrated theoretically that DNA could be programmed to form large-scale synthetic assemblies due to its unique and highly specific basepairing properties.

nanometer-diameterSince that landmark work, the field of molecular engineering using nucleic acids has witnessed explosive growth. Unlike proteins, DNA is highly programmable structurally because it can be designed to robustly self-assemble into large-scale molecular architectures of precise nanometer-scale structural features, dimensions, and mechanical properties.

These assemblies can subsequently be functionalized chemically using lipids, dyes, and proteins for diverse applications in biomolecular science and technology.

The rapidly decreasing cost of synthetic DNA, together with rational computational design rules, now enable a plethora of structured nanoscale materials to be designed, with the ultimate aim of replicating the function of biological protein assemblies that have evolved over billions of years.
Continue reading

panda-detectors

A CUDA Dynamic Parallelism Case Study: PANDA

This post concludes an introductory series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. The second post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. In this post, I finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment part of the (Facility for Antiproton and Ion Research in Europe (FAIR)). The PANDA work was carried out in the scope of the NVIDIA Application Lab at Jülich.

The PANDA Experiment

PANDA (= anti-Proton ANnihilation at DArmstadt) is a state-of-the-art hadron particle physics experiment currently under construction at FAIR (Facility for Anti-proton and Ion Research) at Darmstadt. It is scheduled to start operation in 2019.

Figure 1: The PANDA experiment and its detectors. Image courtesy of PANDA Collaboration.

Inside the PANDA experiment, accelerated antiprotons will collide with protons, forming intermediate and unstable particles (mesons, baryons etc.), which will decay in cascades into stable particles, like electrons and photons. The unstable particles are of particular interest for PANDA, as they give insight into the processes governing this physics regime (QCD). Reconstructing all involved constituent particles of an event lets the physicists form a picture of the process, eventually confirming established physics theories, probing new ones and potentially finding exciting and unexpected results.
Continue reading

CUDALibs[1]

Drop-in Acceleration of GNU Octave

cuBLAS is an implementation of the BLAS library that leverages the teraflops of performance provided by NVIDIA GPUs.  However, cuBLAS can not be used as a direct BLAS replacement for applications originally intended to run on the CPU. In order to use the cuBLAS API:

  • a CUDA context first needs to be created
  • a cuBLAS handle needs to be initialized
  • all relevant data needs to be copied to preallocated GPU memory, followed by deallocation after the computation

Such an API permits the fine tuning required to minimize redundant data copies to and from the GPU in arbitrarily complicated scenarios such that maximum performance is achieved.  But it is less convenient when just a few BLAS routines need to be accelerated (simple data copy) or when vast amounts of code need to be modified (large programmer effort).  In these cases it would be useful to have an API which managed the data transfer to and from the GPU automatically and could be used as a direct replacement for CPU BLAS libraries.

Additionally, there is the common case where the input matrices to the BLAS operations are too large to fit on the GPU.  While using the cuBLAS API to write a tiled BLAS implementation (which achieves even higher performance) is straightforward, a GPU BLAS library which implemented and managed such tiling in a near optimal way would certainly facilitate access to the computing power of the GPU.

To address these issues, CUDA 6 adds new Multi-GPU extensions, implemented for the most compute intensive BLAS Level 3 routines. They are called cuBLAS-XT and can work directly with host data, removing the need to manually allocate and copy data to the GPU’s memory. NVBLAS is a dynamic library built on top of these extensions which offers a transparent BLAS Level 3 acceleration with zero coding effort.  That is, CPU BLAS libraries can be directly replaced with NVBLAS.  As such, NVBLAS can be used to easily accelerate any application which uses level-3 BLAS routines.
Continue reading

cuda_pro_tip

CUDA Pro Tip: Minimize the Tail Effect

When I work on the optimization of CUDA kernels, I sometimes see a discrepancy between Achieved and Theoretical Occupancies. The Theoretical Occupancy is the ratio between the number of threads which may run on each multiprocessor (SM) and the maximum number of executable threads per SM (2048 on the Kepler architecture). This value is estimated from the size of the blocks and the amount of resources (registers and shared memory) used by those blocks for a particular GPU and is computed without running the kernel on the GPU. The Achieved Occupancy, on the other hand, is measured from the execution of the kernel (as the number of active warps divided by the number of active cycles compared to the maximum number of executable warps).

Recently, while working on a kernel for a finance benchmark, I could see an Achieved Occupancy of 41.52% whereas the Theoretical Occupancy was 50%. In NVIDIA Nsight Visual Studio Edition, the Instruction per Clock (IPC) showed a lot of load imbalance between the different SMs with respect to the number of executed instructions by the kernel (see the left graph in the figure below).

Instruction per Clock (IPC) with tail effect (Left) and without (Right)
Instruction per Clock (IPC) with tail effect (Left) and without (Right)

Continue reading

openacc-logo-thumb

Accelerating a C++ CFD code with OpenACC

Computational Fluid Dynamics (CFD) is a valuable tool to study the behavior of fluids. Today, many areas of engineering use CFD. For example, the automotive industry uses CFD to study airflow around cars, and to optimize the car body shapes to reduce drag and improve fuel efficiency. To get accurate results in fluid simulation it is necessary to capture complex phenomena such as turbulence, which requires very accurate models. These complex models result in very long computing times. In this post I describe how I used OpenACC to accelerate the ZFS C++ CFD solver with NVIDIA Tesla GPUs.

The ZFS flow solver

Engine
Figure 1: Using ZFS to study fluid flow within an internal combustion engine with moving pistons and valves.

The C++ flow solver ZFS (Zonal Flow Solver) is developed at the Institute of Aerodynamics at RWTH Aachen, Germany. ZFS solves the unsteady Navier-Stokes equations for compressible flows on automatically generated hierarchical Cartesian grids with a fully-conservative second-order-accurate finite-volume method [1, 2, 3]. To integrate the flow equations in time ZFS uses a 5-step Runge-Kutta method with dual time stepping [2]. It imposes boundary conditions using a ghost-cell method [4] that can handle multiple ghost cells [5, 6]. ZFS supports complex moving boundaries which are sharply discretized using a cut-cell type immersed-boundary method [1, 2, 7].

Among other topics, scientists have used ZFS to study the flow within an internal combustion engine with moving pistons and valves, as Figure 1 shows. Figure 2 shows how the Lattice-Boltzmann solver in ZFS was used to better understand airflow within the human nasal cavity.
Continue reading

nsight_esclipse_logo

NVIDIA Nsight Eclipse Edition for Jetson TK1

NVIDIA® Nsight™ Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA® applications for either your local (x86) system or a remote (x86 or ARM) target. In this post, I will walk you through the process of remote-developing CUDA applications for the NVIDIA Jetson TK1, an ARM-based development kit.

Nsight supports two remote development modes: cross-compilation and “synchronize projects” mode. Cross-compiling for ARM on your x86 host system requires that all of the ARM libraries with which you will link your application be present on your host system. In synchronize-projects mode, on the other hand, your source code is synchronized between host and target systems and compiled and linked directly on the remote target, which has the advantage that all your libraries get resolved on the target system and need not be present on the host. Neither of these remote development modes requires an NVIDIA GPU to be present in your host system.

Note: CUDA cross-compilation tools for ARM are available only in the Ubuntu 12.04 DEB package of the CUDA 6 Toolkit.  If your host system is running a Linux distribution other than Ubuntu 12.04, I recommend the synchronize-projects remote development mode, which I will cover in detail in a later blog post.

CUDA toolkit setup

The first step involved in cross-compilation is installing the CUDA 6 Toolkit on your host system. To get started, let’s download the required Ubuntu 12.04 DEB package from the CUDA download page. Installation instructions can be found in the Getting Started Guide for Linux, but I will summarize them below for CUDA 6.
Continue reading

cuda_spotlight

CUDA Spotlight: GPU-Accelerated Deep Learning

Ren-Wu-BaiduOur Spotlight is on Dr. Ren Wu, a distinguished scientist at Baidu’s Institute of Deep Learning (IDL).

He is known for his pioneering research in using GPUs to accelerate big data analytics and his contribution to large-scale clustering algorithms via the GPU. Ren was a speaker at GTC14 and was originally featured as a CUDA Spotlight in 2011 when he worked at HP Labs.

[Editor's note: On May 16, Baidu announced the hiring of Dr. Andrew Ng to lead Baidu's Silicon Valley Research Lab.]

The following is an excerpt from our interview (read the complete Spotlight here).
______________________________________________________________________________

NVIDIA: Ren, why is GPU computing important to your work?
Ren: A key factor in the progress we are making with deep learning is that we now have much greater computing resources in our hands.

Today one or two workstations with a few GPUs has the same computing power as the fastest supercomputer in the world 15 years ago, thanks to GPU computing and NVIDIA’s vision.
Continue reading

dynpar_thumb

CUDA Dynamic Parallelism API and Principles

This post is the second in a series on CUDA Dynamic Parallelism. In my first post, I introduced Dynamic Parallelism by using it to compute images of the Mandelbrot set using recursive subdivision, resulting in large increases in performance and efficiency. This post is an in-depth tutorial on the ins and outs of programming with Dynamic Parallelism, including synchronization, streams, memory consistency, and limits. My next post will finish the series with a case study on an online track reconstruction algorithm for the high-energy physics PANDA experiment (Facility for Antiproton and Ion Research in Europe (FAIR)).

Grid Nesting and Synchronization

In the CUDA programming model, a group of blocks of threads that are running a kernel is called a grid. In CUDA Dynamic Parallelism, a parent grid launches kernels called child grids. A child grid inherits from the parent grid certain attributes and limits, such as the L1 cache / shared memory configuration and stack size. Note that every thread that encounters a kernel launch executes it. Therefore, if the parent grid has 128 blocks with 64 threads each, and there is no control flow around a child kernel launch, then the grid will perform a total of 8192 kernel launches. If you want a kernel to only launch one child grid per thread block, you should launch the kernel from a single thread of each block as in the following code.

if(threadIdx.x == 0) {
  child_k <<< (n + bs - 1) / bs, bs >>> ();
}

Continue reading