How to Implement Performance Metrics in CUDA Fortran

In the first post of this series we looked at the basic elements of CUDA Fortran by examining a CUDA Fortran implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA Fortran codes. We will rely on these performance measurement techniques in future posts where performance optimization will be increasingly important.

CUDA performance measurement is most commonly done from host code, and can be implemented using either CPU timers or CUDA-specific timers. Before we jump into these performance measurement techniques, we need to discuss how execution between the host and device are synchronized.

Host-Device Synchronization

Let’s take a look at the data transfers and kernel launch of the SAXPY host code from the previous post:

x_d = x 
y_d = y 
call saxpy<<<grid, <span="" class="hiddenSpellError" pre="">tBlock>>>(x_d, y_d, a) 
y = y_d

The data transfers between the host and device performed by assignment statements are synchronous (or blocking) transfers. Synchronous data transfers do not begin until all previously issued CUDA calls have completed, and subsequent CUDA calls cannot begin until the synchronous transfer has completed. Therefore the saxpy kernel launch on the third line will not be issued until the transfer y_d = y on the second line has completed. Kernel launches, on the other hand, are asynchronous. Once the kernel is launched on the third line, control returns immediately to the CPU and does not wait for the kernel to complete. While this might seem to set up a race condition for the device-to-host data transfer in the last line, the blocking nature of the data transfer ensures that the kernel completes before the transfer begins.

Timing Kernel Execution with CPU Timers

Now let’s take a look at how to time the kernel execution using a CPU timer.

x_d = x 
y_d = y 
t1 = myCPUTimer()
call saxpy<<<grid, tblock="">>>(x_d, y_d, a) 
istat = cudaDeviceSynchronize()
t2 = myCPUTimer()
y = y_d

In the above code listing, in addition to the two calls to the generic host time-stamp function myCPUTimer(), we use the explicit synchronization barrier cudaDeviceSynchronize() to block CPU execution until all previously issued commands on the device have completed. If this barrier were not used, this code segment would measure the kernel launch time rather than the kernel execution time.

Timing using CUDA Events

A problem with using host-device synchronization points, such as cudaDeviceSynchronize(), is that they stall the GPU pipeline. For this reason, CUDA offers a relatively light-weight alternative to CPU timers: the CUDA event API. The CUDA event API includes calls to create and destroy events, record events, and compute the elapsed time in milliseconds between two recorded events.

CUDA events make use of the concept of CUDA streams. A CUDA stream is simply a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped—a property that can be used to hide data transfers between the host and the device (we will discuss this in detail later). Up to now, all operations on the GPU have occurred in the default stream, or stream 0 (also called the “Null Stream”).

In the following listing we apply CUDA events to our SAXPY code.

type(cudaEvent) :: startEvent, stopEvent
real :: time
...
istat = cudaEventCreate(startEvent)
istat = cudaEventCreate(stopEvent)

x_d = x 
y_d = y 
istat = cudaEventRecord(startEvent,0)
call saxpy<<<grid, tblock="">>>(x_d, y_d, a) 
istat = cudaEventRecord(stopEvent,0)
istat = cudaEventSynchronize(stopEvent)
istat = cudaEventElapsedTime(time, startEvent, stopEvent)
y = y_d

istat = cudaEventDestroy(startEvent)
istat = cudaEventDestroy(stopEvent)

CUDA events are of type cudaEvent and are created and destroyed with cudaEventCreate() and cudaEventDestroy(). In the above code cudaEventRecord() is used to place the start and stop events into the default stream, stream 0. The device will record a time stamp for the event when it reaches that event in the stream. The function cudaEventSynchronize() blocks CPU execution until the specified event is recorded. The cudaEventElapsedTime() function returns in the first argument the number of milliseconds elapsed between the recording of startEvent and stopEvent. This value has a resolution of approximately one half microsecond.

Memory Bandwidth

Now that we have a means of accurately timing kernel execution, we will use it to calculate bandwidth. When evaluating bandwidth efficiency, we use both the theoretical peak bandwidth and the observed or effective memory bandwidth.

Theoretical Bandwidth

Theoretical bandwidth can be calculated using hardware specifications available in the product literature. For example, the NVIDIA Tesla C2050 GPU uses DDR (double data rate) RAM with a memory clock rate of 1,500 MHz and a 384-bit wide memory interface. Using these data items, the peak theoretical memory bandwidth of the NVIDIA Tesla C2050 is 144 GB/sec, as computed in the following.

BWTheoretical = 1500 * 106 * (384/8) * 2 / 109 = 144 GB/s

In this calculation, the memory clock rate is converted to Hz, multiplied by the interface width (divided by 8, to convert bits to bytes) and multiplied by 2 due to the double data rate. Finally, this product is divided by 109 to convert the result to GB/s.

Effective Bandwidth

Effective bandwidth is calculated by timing specific program activities and using our knowledge of how data is accessed by the program. To do so, we use the following equation.

BWEffective = (RB + WB) / (t * 109)

Here, BWEffective is the effective bandwidth in GB/s, RB is the number of bytes read per kernel, WB is the number of bytes written per kernel, and t is the elapsed time in seconds.

We can modify our SAXPY example to calculate the effective bandwidth.  The complete code is:

module mathOps
contains
  attributes(global) subroutine saxpy(x, y, a)
    implicit none
    real :: x(:), y(:)
    real, value :: a 
    integer :: i, n
    n = size(x)
    i = blockDim%x * (blockIdx%x - 1) + threadIdx%x
    if (i <= n) y(i) = y(i) + a*x(i)
  end subroutine saxpy
end module mathOps

program testSaxpy
  use mathOps
  use cudafor
  implicit none
  integer, parameter :: N = 20*1024*1024
  real :: x(N), y(N), a
  real, device :: x_d(N), y_d(N)
  type(dim3) :: grid, tBlock
  type (cudaEvent) :: startEvent, stopEvent
  real :: time
  integer :: istat

  tBlock = dim3(512,1,1)
  grid = dim3(ceiling(real(N)/tBlock%x),1,1)

  istat = cudaEventCreate(startEvent)
  istat = cudaEventCreate(stopEvent)

  x = 1.0; y = 2.0; a = 2.0
  x_d = x
  y_d = y
  istat = cudaEventRecord(startEvent, 0)
  call saxpy<<<grid, tblock="">>>(x_d, y_d, a)
  istat = cudaEventRecord(stopEvent, 0)
  istat = cudaEventSynchronize(stopEvent)
  istat = cudaEventElapsedTime(time, startEvent, stopEvent)

  y = y_d
  write(*,*) 'Max error: ', maxval(abs(y-4.0))
  write(*,*) 'Effective Bandwidth (GB/s): ', N*4*3/time/10**6

  istat = cudaEventDestroy(startEvent)
  istat = cudaEventDestroy(stopEvent)

end program testSaxpy

In the bandwidth calculation, N*4 is the number of bytes transferred per array read or write, and the factor of three represents the reading of x and the reading and writing of y. The elapsed time in the variable time is in milliseconds. Note that in addition to adding the functionality needed for the bandwidth calculation, we have also changed the array size and the thread-block size. Compiling and running this code on a Tesla C2050 we have:

$ pgf90 -o saxpyTimed saxpyTimed.cuf 
$ ./saxpyTimed 
 Max error:     0.000000    
 Effective Bandwidth (GB/s):     90.78056

Measuring Computational Throughput

We just demonstrated how to measure bandwidth, which is a measure of data throughput. Another metric very important to performance is computational throughput. A common measure of computational throughput is GFLOP/s, which stands for “Giga-FLoating-point OPerations per second”, where Giga is the prefix for 109. For our SAXPY computation, measuring effective throughput is simple: each SAXPY element does a multiply-add operation, which is typically measured as two FLOPs, so we have

GFLOP/s effective = 2N / (t * 109)

is the number of elements in our SAXPY operation, and t is the elapsed time in seconds. Like theoretical peak bandwidth, theoretical peak GFLOP/s can be gleaned from the product literature (but calculating it can be a bit tricky because it is very architecture-dependent). For example, the Tesla C2050 GPU has a theoretical peak single-precision floating point throughput of 1030 GFLOP/s, and a theoretical peak double-precision throughput of 515 GFLOP/s.

SAXPY reads 12 bytes per element computed, but performs only a single multiply-add instruction (2 FLOPs), so it’s pretty clear that it should be bandwidth bound, and so in this case (in fact in many cases), bandwidth is the most important metric to measure and optimize. In more sophisticated computations, measuring performance at the level of FLOPs can be very difficult. Therefore it’s more common to use profiling tools to get an idea of whether computational throughput is a bottleneck. Applications often provide throughput metrics that are problem-specific (rather than architecture specific) and therefore more useful to the user. For example, “Billion Interactions per Second” for astronomical n-body problems, or “nanoseconds per day” for molecular dynamics simulations.

Summary

In this post we described how you can time kernel execution using the CUDA event API. CUDA events use the GPU timer and therefore avoid the problems associated with host-device synchronization. We presented the effective bandwidth and computational throughput performance metrics, and we implemented effective bandwidth in the SAXPY kernel. A large percentage of kernels are memory bandwidth bound, so calculation of the effective bandwidth is a good first step in performance optimization. In a future post we will discuss how to determine which factor—bandwidth, instructions, or latency—is the limiting factor in performance.

CUDA events can also be used to determine the data transfer rate between host and device, by recording events on either side of the assignment statements that perform the transfers.

If you run the code from this post on a smaller GPU, you may get an error message regarding insufficient device memory unless you reduce the array sizes. In the next post, we look at how error handling is performed and how to query the present devices to determine their available resources.

∥∀

About Greg Ruetsch

Greg Ruetsch is a Senior Applied Engineer at NVIDIA, where he works on CUDA Fortran and performance optimization of HPC codes. He holds a Bachelor’s degree in mechanical and aerospace engineering from Rutgers University and a Ph.D. in applied mathematics from Brown University. Prior to joining NVIDIA he has held research positions at Stanford University’s Center for Turbulence Research and Sun Microsystems Laboratories.