cuda_pro_tip

CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

CUDA 5 added a powerful new tool to the CUDA Toolkit: nvprof. nvprof is a command-line profiler available for Linux, Windows, and OS X. At first glance, nvprof seems to be just a GUI-less version of the graphical profiling features available in the NVIDIA Visual Profiler and NSight Eclipse edition. But nvprof is much more than that; to me, nvprof is the light-weight profiler that reaches where other tools can’t.

Use nvprof for Quick Checks

I often find myself wondering if my CUDA application is running as I expect it to. Sometimes this is just a sanity check: is the app running kernels on the GPU at all? Is it performing excessive memory copies? By running my application with nvprof ./myApp, I can quickly see a summary of all the kernels and memory copies that it used, as shown in the following sample output.

    ==9261== Profiling application: ./tHogbomCleanHemi
    ==9261== Profiling result:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     58.73%  737.97ms      1000  737.97us  424.77us  1.1405ms  subtractPSFLoop_kernel(float const *, int, float*, int, int, int, int, int, int, int, float, float)
     38.39%  482.31ms      1001  481.83us  475.74us  492.16us  findPeakLoop_kernel(MaxCandidate*, float const *, int)
      1.87%  23.450ms         2  11.725ms  11.721ms  11.728ms  [CUDA memcpy HtoD]
      1.01%  12.715ms      1002  12.689us  2.1760us  10.502ms  [CUDA memcpy DtoH]

In its default summary mode, nvprof presents an overview of the GPU kernels and memory copies in your application. The summary groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all CUDA API calls.

Following is an example of profiling the nbody sample application running on two GPUs on my PC, using nvprof --print-gpu-trace. We can see on which GPU each kernel ran, as well as the grid dimensions used for each launch. This is very useful when you want to verify that a multi-GPU application is running as you expect.

nvprof --print-gpu-trace ./nbody --benchmark -numdevices=2 -i=1
...
==4125== Profiling application: ./nbody --benchmark -numdevices=2 -i=1
==4125== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
260.78ms     864ns                    -               -         -         -         -        4B  4.6296MB/s   Tesla K20c (0)         2         2  [CUDA memcpy HtoD]
260.79ms     960ns                    -               -         -         -         -        4B  4.1667MB/s  GeForce GTX 680         1         2  [CUDA memcpy HtoD]
260.93ms     896ns                    -               -         -         -         -        4B  4.4643MB/s   Tesla K20c (0)         2         2  [CUDA memcpy HtoD]
260.94ms     672ns                    -               -         -         -         -        4B  5.9524MB/s  GeForce GTX 680         1         2  [CUDA memcpy HtoD]
268.03ms  1.3120us                    -               -         -         -         -        8B  6.0976MB/s   Tesla K20c (0)         2         2  [CUDA memcpy HtoD]
268.04ms     928ns                    -               -         -         -         -        8B  8.6207MB/s  GeForce GTX 680         1         2  [CUDA memcpy HtoD]
268.19ms     864ns                    -               -         -         -         -        8B  9.2593MB/s   Tesla K20c (0)         2         2  [CUDA memcpy HtoD]
268.19ms     800ns                    -               -         -         -         -        8B  10.000MB/s  GeForce GTX 680         1         2  [CUDA memcpy HtoD]
274.59ms  2.2887ms             (52 1 1)       (256 1 1)        36        0B  4.0960KB         -           -   Tesla K20c (0)         2         2  void integrateBodies(vec4::Type*, vec4::Type*, vec4::Type*, unsigned int, unsigned int, float, float, int) [242]
274.67ms  981.47us             (32 1 1)       (256 1 1)        36        0B  4.0960KB         -           -  GeForce GTX 680         1         2  void integrateBodies(vec4::Type*, vec4::Type*, vec4::Type*, unsigned int, unsigned int, float, float, int) [257]
276.94ms  2.3146ms             (52 1 1)       (256 1 1)        36        0B  4.0960KB         -           -   Tesla K20c (0)         2         2  void integrateBodies(vec4::Type*, vec4::Type*, vec4::Type*, unsigned int, unsigned int, float, float, int) [275]
276.99ms  979.36us             (32 1 1)       (256 1 1)        36        0B  4.0960KB         -           -  GeForce GTX 680         1         2  void integrateBodies(vec4::Type*, vec4::Type*, vec4::Type*, unsigned int, unsigned int, float, float, int) [290]

Regs: Number of registers used per CUDA thread.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

Use nvprof to Profile Anything

nvprof knows how to profile CUDA kernels running on NVIDIA GPUs, no matter what language they are written in (as long as they are launched using the CUDA runtime API or driver API). This means that I can use nvprof to profile OpenACC programs (which have no explicit kernels), or even programs that generate PTX assembly kernels internally. Mark Ebersole showed a great example of this in his recent CUDACast (Episode #10) about CUDA Python, in which he used the NumbaPro compiler (from Continuum Analytics) to Just-In-Time compile a Python function and run it in parallel on the GPU.

During initial implementation of OpenACC or CUDA Python programs, it may not be obvious whether or not a function is running on the GPU or the CPU (especially if you aren’t timing it). In Mark’s example, he ran the Python interpreter inside of nvprof, capturing a trace of the application’s CUDA function calls and kernel launches, showing that the kernel was indeed running on the GPU, as well as the cudaMemcpy calls used to transfer data from the CPU to the GPU. This is a great example of the “sanity check” ability of a lightweight command line GPU profiler like nvprof.

Use nvprof for Remote Profiling

Sometimes the system that you are deploying on is not your desktop system. For example, if you use a GPU cluster or a cloud system such as Amazon EC2, and you only have terminal access to the machine. This is another great use for nvprof. Simply connect to the remote machine (using ssh, for example), and run your application under nvprof.

By using the --output-profile command-line option, you can output a data file for later import into either nvprof or the NVIDIA Visual Profiler. This means that you can capture a profile on a remote machine, and then visualize and analyze the results on your desktop in the Visual Profiler (see “Remote Profiling” for more details).

nvprof provides a handy option (--analysis-metrics) to capture all of the GPU metrics that the Visual Profiler needs for its “guided analysis” mode.  The screenshot below shows the visual profiler being used to determine the bottleneck of a kernel. The data for this analysis were captured using the command line below.

nvprof --analysis-metrics -o  nbody-analysis.nvprof ./nbody --benchmark -numdevices=2 -i=1
The NVIDIA Visual Profiler
A Screenshot of the NVIDIA Visual Profiler (nvvp) analyzing data imported from the nvprof command line profiler.

A Very Handy Tool

If you are a fan of command-line tools, I think you will love using nvprof.  There is a lot more that nvprof can do that I haven’t even touched on here, such as collecting profiling metrics for analysis in the NVIDIA Visual Profiler. Check out the nvprof documentation for full details.

I hope that after reading this post, you’ll find yourself using it every day, like a handy pocket knife that you carry with you.

∥∀

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
  • George

    Hello and thanks for the CUDA posts.

    I wanted to ask you.I used the nvprof in the example (saxpy http://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/) and it gives me 7 registers.When I use –ptxas it gives me 3 registers.
    Can you explain me that?

    Thank you!

    • http://www.markmark.net/ Mark Harris

      Hi George, thanks for your comment. I think I need more details to help. What exact command line are you using to compile? And what GPU / compute capability are you running on?

      • George

        Hello,

        I am running on 2.1 compute capability.

        I used the commands

        nvprof –print-gpu-trace ./run –benchmark -i=1

        and

        nvcc -o run test.cu –ptxas-options=-v

        • http://www.markmark.net/ Mark Harris

          You are using the default architecture, which is sm_10. On sm_10, the code uses 3 registers. But your binary also includes PTX, which is JITed at load time to sm_21 when you run on your CC 2.1 GPU. See this pro tip: http://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/

          sm_21 requires more registers for the same code (but also has a larger register file).

          When I run this:

          nvcc -arch=sm_21 -o run saxpy.cu –ptxas-options=-v

          I see this output:

          c:srctest>nvcc -arch=sm_21 -o run saxpy.cu –ptxas-options=-v
          ptxas : info : 0 bytes gmem
          ptxas : info : Compiling entry function ‘_Z5saxpyifPfS_’ for ‘sm_21′
          ptxas : info : Function properties for _Z5saxpyifPfS_
          0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
          ptxas : info : Used 6 registers, 56 bytes cmem[0]
          Creating library run.lib and object run.exp

          So 6 registers. However, running in nvprof still shows 7 registers. I’m not sure about the cause of this discrepancy but I will file a bug! Thanks!

          • George

            Ok!Same output here.
            So, I must always use the sm_21 (for 2.1capability).
            So, –ptxas and nvprof must give the same results always?

            Thank you!

          • http://www.markmark.net/ Mark Harris

            You don’t have to explicitly specify the arch version (sm_21), but if you want full control over what code is generated you might want to. I recommend you read my post linked above about fat binaries and JIT linking.

            As I wrote I think the profiler *should* match the ptxas output, so I have filed an issue internally to figure that out.

          • George

            Ok,thank you!

          • http://www.markmark.net/ Mark Harris

            I got the answer. To support profiling (for example of concurrent kernels), the profiler has to patch kernel code with some additional instructions, sometimes consuming extra registers. So in this case it uses an extra register. You can verify this by running

            nvprof –print-gpu-trace –concurrent-kernels-off ./run

            This disables profiling of concurrent kernels (not needed for this app), and you will see the register count drop to 6.

          • George

            Ok!Thanks for the tip!

  • Wolf Lin

    Hello, I have one question about CSV file.

    When I run nvpro –csv my.exe in the windows system.

    I can’t find my csv file in tmp floder.

    Can I set the path for my csv file?

    how can I do it ?