CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

CUDA 5 added a powerful new tool to the CUDA Toolkit: nvprof. nvprof is a command-line profiler available for Linux, Windows, and OS X. At first glance, nvprof seems to be just a GUI-less version of the graphical profiling features available in the NVIDIA Visual Profiler and NSight Eclipse edition. But nvprof is much more than that; to me, nvprof is the light-weight profiler that reaches where other tools can’t.

Use nvprof for Quick Checks

I often find myself wondering if my CUDA application is running as I expect it to. Sometimes this is just a sanity check: is the app running kernels on the GPU at all? Is it performing excessive memory copies? By running my application with nvprof ./myApp, I can quickly see a summary of all the kernels and memory copies that it used, as shown in the following sample output.

    ==9261== Profiling application: ./tHogbomCleanHemi
    ==9261== Profiling result:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     58.73%  737.97ms      1000  737.97us  424.77us  1.1405ms  subtractPSFLoop_kernel(float const *, int, float*, int, int, int, int, int, int, int, float, float)
     38.39%  482.31ms      1001  481.83us  475.74us  492.16us  findPeakLoop_kernel(MaxCandidate*, float const *, int)
      1.87%  23.450ms         2  11.725ms  11.721ms  11.728ms  [CUDA memcpy HtoD]
      1.01%  12.715ms      1002  12.689us  2.1760us  10.502ms  [CUDA memcpy DtoH]

In its default summary mode, nvprof presents an overview of the GPU kernels and memory copies in your application. The summary groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all CUDA API calls.

Following is an example of profiling the nbody sample application running on two GPUs on my PC, using nvprof --print-gpu-trace. We can see on which GPU each kernel ran, as well as the grid dimensions used for each launch. This is very useful when you want to verify that a multi-GPU application is running as you expect.

nvprof --print-gpu-trace ./nbody --benchmark -numdevices=2 -i=1
...
==4125== Profiling application: ./nbody --benchmark -numdevices=2 -i=1
==4125== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
260.78ms     864ns                    -               -         -         -         -        4B  4.6296MB/s   Tesla K20c (0)         2         2  [CUDA memcpy HtoD]
260.79ms     960ns                    -               -         -         -         -        4B  4.1667MB/s  GeForce GTX 680         1         2  [CUDA memcpy HtoD]
260.93ms     896ns                    -               -         -         -         -        4B  4.4643MB/s   Tesla K20c (0)         2         2  [CUDA memcpy HtoD]
260.94ms     672ns                    -               -         -         -         -        4B  5.9524MB/s  GeForce GTX 680         1         2  [CUDA memcpy HtoD]
268.03ms  1.3120us                    -               -         -         -         -        8B  6.0976MB/s   Tesla K20c (0)         2         2  [CUDA memcpy HtoD]
268.04ms     928ns                    -               -         -         -         -        8B  8.6207MB/s  GeForce GTX 680         1         2  [CUDA memcpy HtoD]
268.19ms     864ns                    -               -         -         -         -        8B  9.2593MB/s   Tesla K20c (0)         2         2  [CUDA memcpy HtoD]
268.19ms     800ns                    -               -         -         -         -        8B  10.000MB/s  GeForce GTX 680         1         2  [CUDA memcpy HtoD]
274.59ms  2.2887ms             (52 1 1)       (256 1 1)        36        0B  4.0960KB         -           -   Tesla K20c (0)         2         2  void integrateBodies(vec4::Type*, vec4::Type*, vec4::Type*, unsigned int, unsigned int, float, float, int) [242]
274.67ms  981.47us             (32 1 1)       (256 1 1)        36        0B  4.0960KB         -           -  GeForce GTX 680         1         2  void integrateBodies(vec4::Type*, vec4::Type*, vec4::Type*, unsigned int, unsigned int, float, float, int) [257]
276.94ms  2.3146ms             (52 1 1)       (256 1 1)        36        0B  4.0960KB         -           -   Tesla K20c (0)         2         2  void integrateBodies(vec4::Type*, vec4::Type*, vec4::Type*, unsigned int, unsigned int, float, float, int) [275]
276.99ms  979.36us             (32 1 1)       (256 1 1)        36        0B  4.0960KB         -           -  GeForce GTX 680         1         2  void integrateBodies(vec4::Type*, vec4::Type*, vec4::Type*, unsigned int, unsigned int, float, float, int) [290]

Regs: Number of registers used per CUDA thread.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

Use nvprof to Profile Anything

nvprof knows how to profile CUDA kernels running on NVIDIA GPUs, no matter what language they are written in (as long as they are launched using the CUDA runtime API or driver API). This means that I can use nvprof to profile OpenACC programs (which have no explicit kernels), or even programs that generate PTX assembly kernels internally. Mark Ebersole showed a great example of this in his recent CUDACast (Episode #10) about CUDA Python, in which he used the NumbaPro compiler (from Continuum Analytics) to Just-In-Time compile a Python function and run it in parallel on the GPU.

During initial implementation of OpenACC or CUDA Python programs, it may not be obvious whether or not a function is running on the GPU or the CPU (especially if you aren’t timing it). In Mark’s example, he ran the Python interpreter inside of nvprof, capturing a trace of the application’s CUDA function calls and kernel launches, showing that the kernel was indeed running on the GPU, as well as the cudaMemcpy calls used to transfer data from the CPU to the GPU. This is a great example of the “sanity check” ability of a lightweight command line GPU profiler like nvprof.

Use nvprof for Remote Profiling

Sometimes the system that you are deploying on is not your desktop system. For example, if you use a GPU cluster or a cloud system such as Amazon EC2, and you only have terminal access to the machine. This is another great use for nvprof. Simply connect to the remote machine (using ssh, for example), and run your application under nvprof.

By using the --output-profile command-line option, you can output a data file for later import into either nvprof or the NVIDIA Visual Profiler. This means that you can capture a profile on a remote machine, and then visualize and analyze the results on your desktop in the Visual Profiler (see “Remote Profiling” for more details).

nvprof provides a handy option (--analysis-metrics) to capture all of the GPU metrics that the Visual Profiler needs for its “guided analysis” mode.  The screenshot below shows the visual profiler being used to determine the bottleneck of a kernel. The data for this analysis were captured using the command line below.

nvprof --analysis-metrics -o  nbody-analysis.nvprof ./nbody --benchmark -numdevices=2 -i=1
The NVIDIA Visual Profiler
A Screenshot of the NVIDIA Visual Profiler (nvvp) analyzing data imported from the nvprof command line profiler.

A Very Handy Tool

If you are a fan of command-line tools, I think you will love using nvprof.  There is a lot more that nvprof can do that I haven’t even touched on here, such as collecting profiling metrics for analysis in the NVIDIA Visual Profiler. Check out the nvprof documentation for full details.

I hope that after reading this post, you’ll find yourself using it every day, like a handy pocket knife that you carry with you.

20 Comments