CUDA 7.5

CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling

[Note: Thejaswi Rao also contributed to the code optimizations shown in this post.]

Today NVIDIA released CUDA 7.5, the latest release of the powerful CUDA Toolkit. One of the most exciting new features in CUDA 7.5 is new Instruction-Level Profiling support in the NVIDIA Visual Profiler. This powerful new feature, available on Maxwell (GM200) and later GPUs, helps pinpoint performance bottlenecks, letting you quickly identify the specific lines of source code (and assembly instructions) limiting the performance of GPU code, along with the underlying reason for execution stalls.

In this post, I demonstrate Instruction-Level Profiling by showing how it helped understand and improve the performance limitations of a CUDA kernel that implements the Iterative Closest Point algorithm (the original source code, by Thomas Whelan, is available on Github). I’ll show how instruction-level profiling makes it easier to apply advanced optimizations, helping speed up the example kernel by 2.7X on an NVIDIA Quadro M6000 GPU.

Profiling the kernel using the Guided Analysis feature of the Visual Profiler showed that the kernel performance was bound by instruction and memory latency. Latency issues indicate that the hardware resources are not used efficiently since most warps are stalled by a dependency on a data value from a previous math or memory instruction. Figure 1 shows that the compute units are only 40% utilized and memory units are around 25% utilized, so there is definitely room for improvement.

Figure 1 Kernel Performance Limiter (Bound by instruction and memory latency) .
Figure 1 Kernel Performance Limiter (Bound by instruction and memory latency).

Stall Analysis in Previous Profiler Versions

Before CUDA 7.5, the Visual Profiler was only capable of pointing out performance issues at the application or CUDA kernel level. For stall latency analysis, the CUDA 7.0 Visual Profiler produces the pie chart in Figure 2 by collecting various stall reason events for the entire kernel.

Figure 2 Legacy (CUDA 7.0) pie chart for stall reasons (generated using events collected at kernel level).
Figure 2 Legacy (CUDA 7.0) pie chart for stall reasons (generated using events collected at the kernel level).

This pie chart shows that the two primary stall reasons in this kernel are synchronization and memory dependencies. But if I look into the kernel code, there are lots of memory accesses and __syncthreads() calls, so this high-level analysis doesn’t provide any specific insight into which instructions are potential bottlenecks. In general it can be very difficult to find exact bottleneck causes in complex kernels using kernel-level profiling analysis. This is where CUDA 7.5 can help, as you’ll see. Continue reading


GPU Pro Tip: Track MPI Calls In The NVIDIA Visual Profiler

Often when profiling GPU-accelerated applications that run on clusters, one needs to visualize MPI (Message Passing Interface) calls on the GPU timeline in the profiler. While tools like Vampir and Tau will allow programmers to see a big picture view of how a parallel application performs, sometimes all you need is a look at how MPI is affecting GPU performance on a single node using a simple tool like the NVIDIA Visual Profiler. With the help of the NVIDIA Tools Extensions (NVTX) and the MPI standard itself, this is pretty easy to do.

The NVTX API lets you embed information within a GPU profile, such as marking events or annotating ranges in the timeline with details about application behavior during that time. Jiri Kraus wrote past posts about generating custom application timelines with NVTX, and about using it to label individual MPI ranks in MPI profiles. In this post I’ll show you how to use an NVTX range to annotate the time spent in MPI calls. To do this, we’ll use the MPI profiling interface (PMPI), which is a standard part of MPI. PMPI allows tools to intercept calls to the MPI library to perform actions before or after the MPI call is executed. This means that we can insert NVTX calls into our MPI library calls to mark MPI calls on the GPU timeline.

Wrapping every MPI routine in this way is a bit tedious, but fortunately there’s a tool to automate the process. We’ll use the script found at to generate the PMPI wrappers for a number of commonly used MPI routines. The input file for this script is the following (also available as a github gist):

#include <pthread.h>
#include <nvToolsExt.h>
#include <nvToolsExtCudaRt.h>
// Setup event category name
{{fn name MPI_Init}}
  nvtxNameCategoryA(999, "MPI");
  int rank;
  PMPI_Comm_rank(MPI_COMM_WORLD, &rank);
  char name[256];
  sprintf( name, "MPI Rank %d", rank );
  nvtxNameOsThread(pthread_self(), name);
  nvtxNameCudaDeviceA(rank, name);
// Wrap select MPI functions with NVTX ranges
{{fn name MPI_Send MPI_Recv MPI_Allreduce MPI_Reduce MPI_Wait MPI_Waitany
MPI_Waitall MPI_Waitsome MPI_Gather MPI_Gatherv MPI_Scatter MPI_Scatterv
MPI_Allgather MPI_Allgatherv MPI_Alltoall MPI_Alltoallv MPI_Alltoallw MPI_Bcast
MPI_Sendrecv MPI_Barrier MPI_Start MPI_Test MPI_Send_init MPI_Recv_init }}
  nvtxEventAttributes_t eventAttrib = {0};
  eventAttrib.version = NVTX_VERSION;
  eventAttrib.messageType = NVTX_MESSAGE_TYPE_ASCII;
  eventAttrib.message.ascii  = "{{name}}";
  eventAttrib.category = 999;

So what’s happening in this file? First, it includes the NVTX header file, and then loops over a series of common MPI functions and inserts the beginning of an NVTX range (nvtxRangePushEx) and then ends the range as we leave the MPI routine (nvtxRangePop). For convenience, I’ve named the range after the MPI routine being called. All I need to do now is call to generate a C file with my PMPI wrappers, which I’ll then build with my MPI C compiler.

$ python wrap/ -g -o nvtx_pmpi.c nvtx.w
$ mpicc -c nvtx_pmpi.c

Now I just need to rerun my code with these wrappers. To do this I’ll relink my application with the object file I just built and the NVTX library (libnvToolsExt). As an example, I’ll use the simple Jacobi Iteration used in the GTC session Multi GPU Programming with MPI, which you can find on Github. Once I’ve built both the application and the wrappers generated above, I run the executable as follows.

$ mpicc -fast -ta=tesla -Minfo=all $HOME/nvtx_pmpi.o laplace2d.c -L$CUDA_HOME/lib64 -lnvToolsExt -o laplace2d
$ MV2_USE_CUDA=1 mpirun -np 2 nvprof -o laplace2d.%q{MV2_COMM_WORLD_RANK}.nvvp ./laplace2d

One word of caution: the linking order does matter when using tools such as PMPI, so if you run your code and are not seeing the expected results, the object file containing the wrappers may not appear early enough in the build command.

In the above commands I’m rebuilding my code with the necessary bits. I’m also setting MV2_USE_CUDA at runtime to enable cuda-awareness in my MVAPICH library. Additionally I’m informing nvprof to generate a timeline file per-MPI process by passing the MV2_COMM_WORLD_RANK environment variable to nvprof, which is defined to equal the MPI rank of each process. Figure 1 is the result of importing one of these resulting nvprof output files into Visual Profiler and then zooming in to an area of interest.

NVIDIA Visual Profiler with MPI ranges.
Figure 1: NVIDIA Visual Profiler with MPI ranges.

Looking in the “Markers and Ranges” row of the GPU timeline for MPI Rank 0, we see three green boxes denoting two calls to MPI_Sendrecv and one to MPI_Allreduce. Furthermore, we can see that the MPI library is using a device-to-device memcpy operation to communicate between two GPUs on the same node. As you can see, the NVIDIA Visual Profiler, combined with PMPI and NVTX can give you interesting insights into how the MPI calls in your application interact with the GPU.


12 Things You Should Know about the Tesla Accelerated Computing Platform

You may already know NVIDIA Tesla as a line of GPU accelerator boards optimized for high-performance, general-purpose computing. They are used for parallel scientific, engineering, and technical computing, and they are designed for deployment in supercomputers, clusters, and workstations. But it’s not just the GPU boards that make Tesla a great computing solution. The combination of the world’s fastest GPU accelerators, the widely used CUDA parallel computing model, and a comprehensive ecosystem of software developers, software vendors, and data center system OEMs make Tesla the leading platform for accelerating data analytics and scientific computing.

The Tesla Accelerated Computing Platform provides advanced system management features and accelerated communication technology, and it is supported by popular infrastructure management software. These enable HPC professionals to easily deploy and manage Tesla accelerators in the data center. Tesla-accelerated applications are powered by CUDA, NVIDIA’s pervasive parallel computing platform and programming model, which provides application developers with a comprehensive suite of tools for productive, high-performance software development.

This post gives an overview of the broad range of technologies, tools, and components of the Tesla Accelerated Computing Platform that are available to application developers. Here’s what you need to know about the Tesla Platform. Continue reading


Remote application development using NVIDIA® Nsight™ Eclipse Edition

NVIDIA® Nsight™ Eclipse Edition (NSEE) is a full-featured unified CPU+GPU integrated development environment(IDE) that lets you easily develop CUDA applications for either your local (x86_64) system or a remote (x86_64 or ARM) target system. In my last post on remote development of CUDA applications, I covered NSEE’s cross compilation mode. In this post I will focus on the using NSEE’s synchronized project mode.

For remote development of CUDA applications using synchronized-project mode, you can edit code on the host system and synchronize it with the target system. In this scenario, the code is compiled natively on the target system as Figure 1 shows.

CUDA application development usage scenarios with Nsight Eclipse Edition
Figure 1: CUDA application development usage scenarios with Nsight Eclipse Edition

In synchronized project mode the host system does not need an ARM cross-compilation tool chain, so you have the flexibility to use Mac OS X or any of the CUDA supported x86_64 Linux platforms as the host system. The remote target system can be a CUDA-supported x86_64 Linux target or an ARM-based platform like the Jetson TK1 system. I am using Mac OS X 10.8.5 on my host system (with Xcode 5.1.1 installed) and 64-bit Ubuntu 12.04 on my target system. Continue reading


CUDA Pro Tip: Profiling MPI Applications

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option --display-map, it’s tedious and error prone. A solution which solves this for the command-line output of nvprof is described here . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler (nvvp).

Profiling MPI applications with nvprof and nvvp

Collecting data with nvprof

nvprof supports dumping the profile to a file which can be later imported into nvvp. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “%p” in the output file name. nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string “%q{ENV}” can be used to name the output file of nvprof. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (mpirun or mpiexec). E.g. for OpenMPI OMPI_COMM_WORLD_RANK is set to the MPI rank for each launched process.

$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./simpleMPI
Running on 2 nodes
==18811== NVPROF is profiling process 18811, command: ./simpleMPI
==18813== NVPROF is profiling process 18813, command: ./simpleMPI
Average of square roots is: 0.667279
==18813== Generated result file: simpleMPI.1.nvprof
==18811== Generated result file: simpleMPI.0.nvprof

Continue reading


NVIDIA Nsight Eclipse Edition for Jetson TK1

NVIDIA® Nsight™ Eclipse Edition is a full-featured, integrated development environment that lets you easily develop CUDA® applications for either your local (x86) system or a remote (x86 or ARM) target. In this post, I will walk you through the process of remote-developing CUDA applications for the NVIDIA Jetson TK1, an ARM-based development kit.

Nsight supports two remote development modes: cross-compilation and “synchronize projects” mode. Cross-compiling for ARM on your x86 host system requires that all of the ARM libraries with which you will link your application be present on your host system. In synchronize-projects mode, on the other hand, your source code is synchronized between host and target systems and compiled and linked directly on the remote target, which has the advantage that all your libraries get resolved on the target system and need not be present on the host. Neither of these remote development modes requires an NVIDIA GPU to be present in your host system.

Note: CUDA cross-compilation tools for ARM are available only in the Ubuntu 12.04 DEB package of the CUDA 6 Toolkit.  If your host system is running a Linux distribution other than Ubuntu 12.04, I recommend the synchronize-projects remote development mode, which I will cover in detail in a later blog post.

CUDA toolkit setup

The first step involved in cross-compilation is installing the CUDA 6 Toolkit on your host system. To get started, let’s download the required Ubuntu 12.04 DEB package from the CUDA download page. Installation instructions can be found in the Getting Started Guide for Linux, but I will summarize them below for CUDA 6.
Continue reading


CUDACasts Episode 19: CUDA 6 Guided Performance Analysis with the Visual Profiler

One of the main reasons for accelerating code on an NVIDIA GPU is for an increase in application performance. This is why it’s important to use the best tools available to help you get the performance you’re looking for. CUDA 6 includes great improvements to the guided analysis tool in the NVIDIA Visual Profiler. Watch today’s CUDACast to see how to use guided analysis to locate potential optimizations for your GPU code.

You can find the code used in this video in the CUDACasts GitHub repository.

Continue reading


5 Powerful New Features in CUDA 6

Today I’m excited to announce the release of CUDA 6, a new version of the CUDA Toolkit that includes some of the most significant new functionality in the history of CUDA. In this brief post I will share with you the most important new features in CUDA 6 and tell you where to get more information. You may also want to watch the recording of my talk “CUDA 6 and Beyond” from last month’s GPU Technology Conference, embedded below.

Without further ado, if you are ready to download the CUDA Toolkit version 6.0 now, by all means, go get it on CUDA Zone. The five most important new features of CUDA 6 are

  • support for Unified Memory;
  • CUDA on Tegra K1 mobile/embedded system-on-a-chip;
  • XT and Drop-In library interfaces;
  • remote development in NSight Eclipse Edition;
  • many improvements to the CUDA developer tools.

Continue reading


CUDACasts Episode 14: Racecheck Analysis with CUDA 5.5

The key to the power of GPUs is their 1000’s of parallel processors that execute threads. Anyone who has worked with even a handful of threads know how easy it can be to introduce race conditions, and how difficult it  can be to debug and fix these errors. Because a modern GPU can have thousands of simultaneously executing threads, NVIDIA engineers felt it was imperative to create an incredibly powerful tool for detecting and debugging race conditions.

This racecheck tool comes as part of the cuda-memcheck command-line utility. In CUDA 5.5 a new racecheck analysis mode presents much more human-readable analysis of your code, even reporting which source lines conflict with other lines. In this episode of CUDACasts we use a simple version of Conway’s Game of Life to show the new racecheck features cuda-memcheck. We’ll start with a few race condition bugs, and then use the analysis tool to find and fix them.

Continue reading


CUDACasts Episode 13: Clock, Power, and Thermal Profiling with Nsight Eclipse Edition

In the world of high-performance computing, it is important to understand how your code affects the operating characteristics of your HW.  For example, if your program executes inefficient code, it may cause the GPU to work harder than it needs to, leading to higher power consumption, and a potential slow-down due to throttling.

A new profiling feature in CUDA 5.5 allows you to profile the clocks, power, and thermal characteristics of the GPU as it executes your code.  This feature is available in the NVIDIA Visual Profiler on Linux and 64-bit Windows 7/8 and NSight Eclipse Edition on Linux.  Learn how to activate and use this feature by watching CUDACasts Episode 13.

Continue reading