CUDA 6

5 Powerful New Features in CUDA 6

Today I’m excited to announce the release of CUDA 6, a new version of the CUDA Toolkit that includes some of the most significant new functionality in the history of CUDA. In this brief post I will share with you the most important new features in CUDA 6 and tell you where to get more information. You may also want to watch the recording of my talk “CUDA 6 and Beyond” from last month’s GPU Technology Conference, embedded below.

Without further ado, if you are ready to download the CUDA Toolkit version 6.0 now, by all means, go get it on CUDA Zone. The five most important new features of CUDA 6 are

  • support for Unified Memory;
  • CUDA on Tegra K1 mobile/embedded system-on-a-chip;
  • XT and Drop-In library interfaces;
  • remote development in NSight Eclipse Edition;
  • many improvements to the CUDA developer tools.

Continue reading

CUDACasts_FeaturedImage

CUDACasts Episode 13: Clock, Power, and Thermal Profiling with Nsight Eclipse Edition

In the world of high-performance computing, it is important to understand how your code affects the operating characteristics of your HW.  For example, if your program executes inefficient code, it may cause the GPU to work harder than it needs to, leading to higher power consumption, and a potential slow-down due to throttling.

A new profiling feature in CUDA 5.5 allows you to profile the clocks, power, and thermal characteristics of the GPU as it executes your code.  This feature is available in the NVIDIA Visual Profiler on Linux and 64-bit Windows 7/8 and NSight Eclipse Edition on Linux.  Learn how to activate and use this feature by watching CUDACasts Episode 13.

Continue reading

cuda_pro_tip

CUDA Pro Tip: nvprof is Your Handy Universal GPU Profiler

CUDA 5 added a powerful new tool to the CUDA Toolkit: nvprof. nvprof is a command-line profiler available for Linux, Windows, and OS X. At first glance, nvprof seems to be just a GUI-less version of the graphical profiling features available in the NVIDIA Visual Profiler and NSight Eclipse edition. But nvprof is much more than that; to me, nvprof is the light-weight profiler that reaches where other tools can’t.

Use nvprof for Quick Checks

I often find myself wondering if my CUDA application is running as I expect it to. Sometimes this is just a sanity check: is the app running kernels on the GPU at all? Is it performing excessive memory copies? By running my application with nvprof ./myApp, I can quickly see a summary of all the kernels and memory copies that it used, as shown in the following sample output.

    ==9261== Profiling application: ./tHogbomCleanHemi
    ==9261== Profiling result:
    Time(%)      Time     Calls       Avg       Min       Max  Name
     58.73%  737.97ms      1000  737.97us  424.77us  1.1405ms  subtractPSFLoop_kernel(float const *, int, float*, int, int, int, int, int, int, int, float, float)
     38.39%  482.31ms      1001  481.83us  475.74us  492.16us  findPeakLoop_kernel(MaxCandidate*, float const *, int)
      1.87%  23.450ms         2  11.725ms  11.721ms  11.728ms  [CUDA memcpy HtoD]
      1.01%  12.715ms      1002  12.689us  2.1760us  10.502ms  [CUDA memcpy DtoH]

In its default summary mode, nvprof presents an overview of the GPU kernels and memory copies in your application. The summary groups all calls to the same kernel together, presenting the total time and percentage of the total application time for each kernel. In addition to summary mode, nvprof supports GPU-Trace and API-Trace modes that let you see a complete list of all kernel launches and memory copies, and in the case of API-Trace mode, all CUDA API calls. Continue reading

cuda_pro_tip

CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX

The last time you used the timeline feature in the NVIDIA Visual Profiler or NSight to analyze a complex application, you might have wished to see a bit more than just CUDA API calls and GPU kernels. Most applications do significant work on both the CPU and GPU, so it would be nice to see in more detail what CPU functions are taking time. This can help identify the sources of idle GPU time, for example.

In this post I will show you how you can use the NVIDIA Tools Extension (NVTX) to annotate the time line with useful information. I will demonstrate how to add time ranges by calling the NVTX API from your application or library. This can be a tedious task for complex applications with deeply nested call-graphs, so I will also explain how to use compiler instrumentation to automate this task.

What is the NVIDIA Tools Extension (NVTX)?

The NVIDIA Tools Extension (NVTX) is an application interface to the NVIDIA Profiling tools, including the NVIDIA Visual Profiler, NSight Eclipse Edition, and NSight Visual Studio Edition. NVTX allows you to annotate the profiler time line with events and ranges and to customize their appearance and assign names to resources such as CPU threads and devices.

Let’s use the following source code as the basis for our example. (This code is incomplete, but complete examples are available in the Parallel Forall Github repository.) Continue reading

cuda_pro_tip

CUDA Pro Tip: View Assembly Code Correlation in Nsight Visual Studio Edition

While high-level languages for GPU programming like CUDA C offer a useful level of abstraction, convenience, and maintainability, they inherently hide some of the details of the execution on the hardware. It is sometimes helpful to dig into the underlying assembly code that the hardware is executing to explore performance problems, or to make sure the compiler is generating the code you expect. Reading assembly language is tedious and challenging; thankfully Nsight Visual Studio Edition can help by showing you the correlation between lines in your high-level source code and the executed assembly instructions.

As Mark Harris explained in the previous CUDA Pro Tip, there are two compilation stages required before a kernel written in CUDA C can be executed on the GPU. The first stage compiles the high-level C code into the PTX virtual GPU ISA. The second stage compiles PTX into the actual ISA of the hardware, called SASS (details of SASS can be found in the cuobjdump.pdf installed in the doc folder of the CUDA Toolkit). The hardware ISA is in general different between GPU architectures. To allow forward compatibility, the second compilation phase can be either done as part of the normal compilation using nvcc or at runtime using the integrated JIT compiler in the driver.

It is possible to manually extract the PTX or SASS from a cubin or executable using the cuobjdump tool included with the CUDA Toolkit. Nsight Visual Studio Edition makes it easier by showing the correlation between lines of CUDA C, PTX, and SASS. Continue reading

cuda_pro_tip

CUDA Pro Tip: Clean Up After Yourself to Ensure Correct Profiling

NVIDIA’s profiling and tracing tools, including the NVIDIA Visual Profiler, NSight Eclipse and Visual Studio editions, cuda-memcheck, and the nvprof command line profiler are powerful tools that can give you deep insight into the performance and correctness of your GPU-accelerated applications. These tools gather data while your application is running, and use it to create profiles, application API traces, automatic optimization guidance, and in the case of cuda-memcheck, memory leak and race checking.

nvvp-particles

To improve tracing performance and reduce overhead in the target application, these tools internally buffer the data they gather, and flush it to disk at various points, including stream synchronization, context synchronization, context destruction, and when the internal buffer is full. For technical reasons, it is not always possible to automatically flush the data on application exit. Therefore, you should clean up your application’s CUDA objects properly to make sure that the profiler is able to store all gathered data. This means not only freeing memory allocated on the GPU, but also resetting the device Context.

If your application uses the CUDA Runtime API, call cudaDeviceReset() just before exiting, or when the application finishes making CUDA calls and using device data. If your application uses the CUDA Driver API, call cuProfilerStop() on each context to flush the profiling buffers before destroying the context with cuCtxDestroy().

Without resetting the device, applications that don’t synchronize before they exit may produce incomplete profile traces. With this simple clean-up step, you can be sure you get an accurate profile.

How to Optimize Data Transfers in CUDA C/C++

In the previous three posts of this CUDA C & C++ series we laid the groundwork for the major thrust of the series: how to optimize CUDA C/C++ code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050, for example) than the peak bandwidth between host memory and device memory (8 GB/s on PCIe x16 Gen2). This disparity means that your implementation of data transfers between the host and GPU devices can make or break your overall application performance. Let’s start with a few general guidelines for host-device data transfers.

  • Minimize the amount of data transferred between host and device when possible, even if that means running kernels on the GPU that get little or no speed-up compared to running them on the host CPU.
  • Higher bandwidth is possible between the host and the device when using page-locked (or “pinned”) memory.
  • Batching many small transfers into one larger transfer performs much better because it eliminates most of the per-transfer overhead.
  • Data transfers between the host and device can sometimes be overlapped with kernel execution and other data transfers.

We investigate the first three guidelines above in this post, and we dedicate the next post to overlapping data transfers. First I want to talk about how to measure time spent in data transfers without modifying the source code. Continue reading

How to Optimize Data Transfers in CUDA Fortran

In the previous three posts of this CUDA Fortran series we laid the groundwork for the major thrust of the series: how to optimize CUDA Fortran code. In this and the following post we begin our discussion of code optimization with how to efficiently transfer data between the host and device. The peak bandwidth between the device memory and the GPU is much higher (144 GB/s on the NVIDIA Tesla C2050, for example) than the peak bandwidth between host memory and device memory (8 GB/s on the PCIe x16 Gen2). This disparity means that your implementation of data transfers between the host and GPU devices can make or break your overall application performance. Let’s start with a few general guidelines for host-device data transfers.

  • Minimize the amount of data transferred between host and device when possible, even if that means running kernels on the GPU that get little or no speed-up compared to running them on the host CPU.
  • Higher bandwidth is possible between the host and the device when using page-locked (or “pinned”) memory.
  • Batching many small transfers into one larger transfer performs much better because it eliminates most of the per-transfer overhead.
  • Data transfers between the host and device can sometimes be overlapped with kernel execution and other data transfers.

We investigate the first three guidelines above in this post, and we dedicate the next post to overlapping data transfers. First I want to talk about how to measure time spent in data transfers without modifying the source code. Continue reading

How to Implement Performance Metrics in CUDA C/C++

In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA C/C++ codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important.

CUDA performance measurement is most commonly done from host code, and can be implemented using either CPU timers or CUDA-specific timers. Before we jump into these performance measurement techniques, we need to discuss how to synchronize execution between the host and device.

Host-Device Synchronization

Let’s take a look at the data transfers and kernel launch of the SAXPY host code from the previous post:

cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);

saxpy<<<(N+255)/256, 256>>>(N, 2.0, d_x, d_y);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

The data transfers between the host and device using cudaMemcpy() are synchronous (or blocking) transfers. Synchronous data transfers do not begin until all previously issued CUDA calls have completed, and subsequent CUDA calls cannot begin until the synchronous transfer has completed. Therefore the saxpy kernel launch on the third line will not issue until the transfer from y to d_y on the second line has completed. Kernel launches, on the other hand, are asynchronous. Once the kernel is launched on the third line, control returns immediately to the CPU and does not wait for the kernel to complete. While this might seem to set up a race condition for the device-to-host data transfer in the last line, the blocking nature of the data transfer ensures that the kernel completes before the transfer begins. Continue reading

How to Implement Performance Metrics in CUDA Fortran

In the first post of this series we looked at the basic elements of CUDA Fortran by examining a CUDA Fortran implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA Fortran codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important.

CUDA performance measurement is most commonly done from host code, and can be implemented using either CPU timers or CUDA-specific timers. Before we jump into these performance measurement techniques, we need to discuss how execution between the host and device are synchronized.

Host-Device Synchronization

Let’s take a look at the data transfers and kernel launch of the SAXPY host code from the previous post:

x_d = x 
y_d = y 
call saxpy<<<grid, <span="" class="hiddenSpellError" pre="">tBlock>>>(x_d, y_d, a) 
y = y_d

The data transfers between the host and device performed by assignment statements are synchronous (or blocking) transfers. Synchronous data transfers do not begin until all previously issued CUDA calls have completed, and subsequent CUDA calls cannot begin until the synchronous transfer has completed. Therefore the saxpy kernel launch on the third line will not be issued until the transfer y_d = y on the second line has completed. Kernel launches, on the other hand, are asynchronous. Once the kernel is launched on the third line, control returns immediately to the CPU and does not wait for the kernel to complete. While this might seem to set up a race condition for the device-to-host data transfer in the last line, the blocking nature of the data transfer ensures that the kernel completes before the transfer begins. Continue reading