Customize CUDA Fortran Profiling with NVTX

The NVIDIA Tools Extension (NVTX) library lets developers annotate custom events and ranges within the profiling timelines generated using tools such as the NVIDIA Visual Profiler (NVVP) and NSight. In my own optimization work, I rely heavily on NVTX to better understand internal as well as customer codes and to spot opportunities for better interaction between the CPU and the GPU.

Two previous Pro Tip posts on Parallel Forall showed how to use NVTX in CUDA C++ and MPI codes. In this post, I’ll show how to use NVTX to annotate the profiles of Fortran codes (with either CUDA Fortran or OpenACC).

NVTX has a lot of features, but here I’ll focus on using it to annotate the profiler output with timeline markers using nvtxRangePush() and nvtxRangePop(). I’ll show you how to insert markers with custom labels and colors. Continue reading

CUDA 7.5

Simple, Portable Parallel C++ with Hemi 2 and CUDA 7.5

The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7 I discussed the importance of C++11 for parallel programming on GPUs, and in the post New Features in CUDA 7.5 I introduced a new experimental feature in the NVCC CUDA C++ compiler: support for GPU Lambda expressions. Lambda expressions, introduced in C++11, provide concise syntax for anonymous functions (and closures) that can be defined in line with their use, can be passed as arguments, and can capture variables from surrounding scopes. GPU Lambdas bring that power and convenience to writing GPU functions, letting you launch parallel work on the GPU almost as easily as writing a for loop.

In this post, I want to show you how modern C++ features combine to enable a higher-level, more portable approach to parallel programming for GPUs. To do so, I’ll show you Hemi 2, the second release of a simple open-source C++ library that I developed to explore approaches to portable parallel C++ programming. I have written before about Hemi on Parallel Forall, but Hemi 2 is easier to use, more portable, and more powerful.

hemi-logo-blogIntroducing Hemi 2

Hemi simplifies writing portable CUDA C/C++ code. With Hemi,

  • you can write parallel kernels like you write for loops—in line in your CPU code—and run them on your GPU;
  • you can launch C++ Lambda functions as GPU kernels;
  • you can easily write code that compiles and runs either on the CPU or GPU;
  • kernel launch configuration is automatic: details like thread block size and grid size are optimization details, rather than requirements.

With Hemi, parallel code for the GPU can be as simple as the parallel_for loop in the following code, which can also be compiled and run on the CPU.

void saxpy(int n, float a, const float *x, float *y)
  hemi::parallel_for(0, n, [=] HEMI_LAMBDA (int i) {
    y[i] = a * x[i] + y[i];

Hemi is BSD-licensed, open-source software, available on Github. Continue reading


Combine OpenACC and Unified Memory for Productivity and Performance

The post Getting Started with OpenACC covered four steps to progressively accelerate your code with OpenACC. It’s often necessary to use OpenACC directives to express both loop parallelism and data locality in order to get good performance with accelerators. After expressing available parallelism, excessive data movement generated by the compiler can be a bottleneck, and correcting this by adding data directives takes effort. Sometimes expressing proper data locality is more effort than expressing parallelism with loop directives.

Wouldn’t it be nice if programs could manage data locality automatically? Well, this is possible today with Unified Memory (on Kepler and newer GPU architectures). In this post I demonstrate how to combine OpenACC with Unified Memory to GPU-accelerate your existing applications with minimal effort. You can download the source code for the example in this post from the Parallel Forall GitHub repository.

Jacobi Iteration with Heap Memory

I’ll use the popular Jacobi iteration example code which is representative of many real-world stencil computations. In contrast to the previous OpenACC post, I modified the array data allocation to use heap memory instead of using automatic stack-allocated arrays. This is a more common scenario for real applications since real-world data arrays are often too large for stack memory. This change also makes it a more challenging case for OpenACC since the compiler no longer knows the size of the arrays. The following excerpt shows the main loop of the Jacobi iteration with 2D index computation. Continue reading


Increasing the Luminosity of Beam Dynamics with GPUs

Adrian_CERNWhat is dark matter? We can neither see it nor detect it with any instrument. CERN is upgrading the LHC (Large Hadron Collider), which is the world’s largest and most powerful particle accelerator ever built, to explore the new high-energy frontier.

The most technically challenging aspects of the upgrade cannot be done by CERN alone and requires collaboration and external expertise. There are 7,000 scientists from over 60 countries working to extend the LHC discovery potential; the accelerator will need a major upgrade around 2020 to increase its luminosity by a factor of 10 beyond the original design value.

Ph.D. student Adrian Oeftiger attends EPFL (École Polytechnique Fédérale de Lausanne) in Switzerland which is one of the High Luminosity LHC beneficiaries. His research group is working to parallelize their algorithms to create software that will offer the possibility of new kinds of beam dynamics studies that have not been possible with the current technology.

Brad: How is your research related to the upgrade of the LHC?

Adrian: My world is all about luminosity; increasing the luminosity of particle beams. It is all about making ultra-high-energy collisions of protons possible, and at the same time providing enough collisions to enable fundamental particle physics research. That means increasing the luminosity. I’m doing my Ph.D. in beam dynamics in the field of accelerator physics.

High Luminosity LHCThese days, high-energy particle accelerators are the tools of choice to analyze and understand the fundamental building blocks of our universe. The huge detectors at the Large Hadron Collider (LHC) at CERN, buried about a hundred meters underground in the countryside near Geneva, need ever-increasing collision rates (hence luminosity!): they gather statistics of collision events to explore new realms of physics, to detect extremely rare interaction combinations and the tiniest quantities of new particles, and to find explanations for some of the numerous wonders of the universe we live in. What is the dark matter which makes up 27% of our universe made of? Why is the symmetry between anti-matter and ordinary matter broken, and why do we find only the latter in the universe?

CERN is preparing for the High Luminosity LHC, a powerful upgrade of the present accelerator to increase the chances to answer some of these fundamental questions. Increasing the chances translates to: we need more collisions, so we need higher luminosity. Continue reading

CUDA 7.5

CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling

[Note: Thejaswi Rao also contributed to the code optimizations shown in this post.]

Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA Toolkit. One of the most exciting new features in CUDA 7.5 is new Instruction-Level Profiling support in the NVIDIA Visual Profiler. This powerful new feature, available on Maxwell (GM200) and later GPUs, helps pinpoint performance bottlenecks, letting you quickly identify the specific lines of source code (and assembly instructions) limiting the performance of GPU code, along with the underlying reason for execution stalls.

In this post, I demonstrate Instruction-Level Profiling by showing how it helped understand and improve the performance limitations of a CUDA kernel that implements the Iterative Closest Point algorithm (the original source code, by Thomas Whelan, is available on Github). I’ll show how instruction-level profiling makes it easier to apply advanced optimizations, helping speed up the example kernel by 2.7X on an NVIDIA Quadro M6000 GPU.

Profiling the kernel using the Guided Analysis feature of the Visual Profiler showed that the kernel performance was bound by instruction and memory latency. Latency issues indicate that the hardware resources are not used efficiently since most warps are stalled by a dependency on a data value from a previous math or memory instruction. Figure 1 shows that the compute units are only 40% utilized and memory units are around 25% utilized, so there is definitely room for improvement.

Figure 1 Kernel Performance Limiter (Bound by instruction and memory latency) .
Figure 1 Kernel Performance Limiter (Bound by instruction and memory latency).

Stall Analysis in Previous Profiler Versions

Before CUDA 7.5, the Visual Profiler was only capable of pointing out performance issues at the application or CUDA kernel level. For stall latency analysis, the CUDA 7.0 Visual Profiler produces the pie chart in Figure 2 by collecting various stall reason events for the entire kernel.

Figure 2 Legacy (CUDA 7.0) pie chart for stall reasons (generated using events collected at kernel level).
Figure 2 Legacy (CUDA 7.0) pie chart for stall reasons (generated using events collected at the kernel level).

This pie chart shows that the two primary stall reasons in this kernel are synchronization and memory dependencies. But if I look into the kernel code, there are lots of memory accesses and __syncthreads() calls, so this high-level analysis doesn’t provide any specific insight into which instructions are potential bottlenecks. In general it can be very difficult to find exact bottleneck causes in complex kernels using kernel-level profiling analysis. This is where CUDA 7.5 can help, as you’ll see. Continue reading


Mocha.jl: Deep Learning for Julia

Deep learning is becoming extremely popular due to several breakthroughs in various well-known tasks in artificial intelligence. For example, at the ImageNet Large Scale Visual Recognition Challenge, the introduction of deep learning algorithms into the challenge reduced the top-5 error by 10% in 2012. Every year since then, deep learning models have dominated the challenges, significantly reducing the top-5 error rate every year (see Figure 1). In 2015, researchers have trained very deep networks (for example, the Google “inception” model has 27 layers) that surpass human performance.

Figure 1: The top-5 error rate in the ImageNet Large Scale Visual Recognition Challenge has been rapidly reducing since the introduction of deep neural networks in 2012.
Figure 1: The top-5 error rate in the ImageNet Large Scale Visual Recognition Challenge has been rapidly reducing since the introduction of deep neural networks in 2012.

Moreover, at this year’s Computer Vision and Pattern Recognition (CVPR) conference, deep neural networks (DNNs) were being adapted to increasingly more complicated tasks. For example, in semantic segmentation, instead of predicting a single category for a whole image, a DNN is trained to classify each pixel in the image, essentially producing a semantic map indicating every object and its shape and location in the given image (see Figure 2).

Figure 2: An example of generating a semantic map by classifying each pixel in a source image. Source: Shuai Zheng et al. Conditional Random Fields as Recurrent Neural Networks.
Figure 2: An example of generating a semantic map by classifying each pixel in a source image. Source: Shuai Zheng et al. Conditional Random Fields as Recurrent Neural Networks.

Continue reading


ATCOM: Real-Time Enhancement of Long-Range Imagery

Imaging over long distances is important for many defense and commercial applications. High-end ground-to-ground, air-to-ground, and ground-to-air systems now routinely image objects several kilometers to several dozen kilometers away; however, this increased range comes at a price. In many scenarios, the limiting factor becomes not the quality of your camera but the atmosphere through which the light travels to the camera. Dynamic changes in the atmospheric density between the camera and object impart time-variant distortions resulting in loss of contrast and detail in the collected imagery (see Figure 1 and Figure 2).

Figure 1: Wavefront distortion by the atmosphere causes the image captured to be blurred. The effect becomes more pronounced when imaging over long distances or through more turbulent atmospheric conditions.
Figure 1: Wavefront distortion by the atmosphere causes the image captured to be blurred. The effect becomes more pronounced when imaging over long distances or through more turbulent atmospheric conditions.

Several approaches have been developed to combat this effect that can be roughly divided into two categories: hardware-based and signal processing approaches. The primary hardware technique is adaptive optics (AO), an approach favored by the astronomical community to observe stellar phenomena. AO techniques generally employ a deformable mirror to correct the incoming wavefront before it is captured on the sensor. While this provides the ability to improve imagery to account for distortions, the equipment required is fragile and expensive and is therefore not suitable for many applications. In contrast, signal processing techniques are limited only by the computational hardware they run on. In our case, we have leveraged the processing power of GPUs to achieve the performance necessary for real-time processing of high-definition video. Thanks to modern GPUs, we are now able to process live 720p video streams at over 30 fps, as the video below shows.

Continue reading


Harnessing the Caffe Framework for Deep Visualization

Jeff Clune, Lab Director, Evolving Intelligence Laboratory at The University of Wyoming.
Jeff Clune, Lab Director, Evolving Intelligence Laboratory at The University of Wyoming.

The need to train their deep neural networks as fast as possible led the Evolving Artificial Intelligence Laboratory at the University of Wyoming to harness the power of NVIDIA Tesla GPUs starting in 2012 to accelerate their research.

“The speedups GPUs provide for training deep neural networks are well-documented and allow us to train models in a week that would otherwise take months,” said Jeff Clune, Assistant Professor, Computer Science Department and Director of the Evolving Artificial Intelligence Laboratory. “And algorithms continuously improve. Recently, NVIDIA’s cuDNN technology allowed us to speed up our training time by an extra 20% or so.”

Clune’s Lab, which focuses on evolving artificial intelligence with a major focus on large-scale, structurally organized neural networks, has garnered press from some of the largest media outlets, including BBC, National Geographic, NBC News, The Atlantic and featured on the cover of Nature in May 2015.

[The following video shows off work from the Evolving AI Lab on visualizing deep neural networks. Keep reading to learn more about this work!]

For this Spotlight interview, I had the opportunity to talk with Jeff Clune and two of his collaborators, Anh Nguyen, a Ph.D. student at the Evolving AI Lab and Jason Yosinski, a Ph.D. candidate at Cornell University.

Brad: How are you using deep neural networks (DNNs)?

We have many research projects involving deep neural networks. Our Deep Learning publications to date involve better understanding DNNs. Our lab’s research covers: Continue reading


High-Performance MATLAB with GPU Acceleration

In this post, I will discuss techniques you can use to maximize the performance of your GPU-accelerated MATLAB® code. First I explain how to write MATLAB code which is inherently parallelizable. This technique, known as vectorization, benefits all your code whether or not it uses the GPU. Then I present a family of function wrappers—bsxfunpagefun, and arrayfun—that take advantage of GPU hardware, yet require no specialist parallel programming skills. The most advanced function, arrayfun, allows you to write your own custom kernels in the MATLAB language.

If these techniques do not provide the performance or flexibility you were after, you can still write custom CUDA code in C or C++ that you can run from MATLAB, as discussed in our earlier Parallel Forall posts on MATLAB CUDA Kernels and MEX functions.

All of the features described here are available out of the box with MATLAB and Parallel Computing Toolbox™.

Mobile phone signal strength example

Throughout this post, I will use an example to help illustrate the techniques. A cellular phone network wants to map its coverage to help plan for new antenna installations. We imagine an idealized scenario with M = 25 cellphone masts, each H = 20 meters in height, evenly spaced on an undulating 10km x 10km terrain. Figure 1 shows what the map looks like.

Figure 1: 3D plot of the map data and antenna positions
Figure 1: 3D plot of the map data and antenna positions

On the GPU, in the following listing we define a number of variables including:

  • map: An N x 3 height field in a 10km x 10km grid (N = 10,000);
  • masts: An M x 3 array of antenna positions, at height H;
  • AntennaDirection: A 3 x M array of vectors representing the orientation of each antenna.

Continue reading


NVIDIA and IBM Cloud Support ImageNet Large Scale Visual Recognition Challenge

This year’s ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is about to begin. Every year, organizers from the University of North Carolina at Chapel Hill, Stanford University, and the University of Michigan host the ILSVRC, an object detection and image classification competition, to advance the fields of machine learning and pattern recognition. Competitors are given more than 1.2 million images from 1,000 different object categories, with the objective of developing the most accurate object detection and classification approach. After the data set is released, teams have roughly three months to create, refine and test their approaches. This year, the data will be released on August 14, and the final result submission deadline is November 13. Read on to learn how competing teams can get free access to the latest GPU-accelerated cloud servers from NVIDIA and IBM Cloud.

Figure 1. Examples of ImageNet images demonstrating classification with localization.
Figure 1. Examples of ImageNet images demonstrating classification with localization.

About the Challenge

Teams compete annually to develop the most accurate recognition systems, and each year the sub-tasks are more complex and challenging. Since none of the algorithms developed to date are able to classify all images correctly 100% of the time, classification accuracy is measured in terms of error rates. In this way, the winning submission will exhibit the lowest overall percentage of incorrectly classified images. When ILSVRC started in 2010, teams had a single assignment: classify the contents of an image (identify objects) and provide a list of the top five classifications with their respective probabilities. Since then, the complexity of assignments has grown as the organizers have added sub-tasks for classification with localization and with detection (identifying objects and specifying their locations with bounding boxes). Specifying bounding boxes is important because most pictures contain multiple objects. To illustrate the requirements of this type of task, Figure 1 shows two sample ImageNet training images annotated with their object categories and bounding boxes. Continue reading