In this third post of the CUDA Fortran series we discuss various characteristics of the wide range of CUDA-capable GPUs, how to query device properties from within a CUDA Fortran program, and how to handle errors.
Querying Device Properties
In our last post, about performance metrics, we discussed how to compute the theoretical peak bandwidth of a GPU. This calculation used the GPU’s memory clock rate and bus interface width, which we obtained from product literature. The following CUDA Fortran code demonstrates a more general approach, calculating the theoretical peak bandwidth by querying the attached device (or devices) for the needed information.
program peakBandwidth use cudafor implicit none integer :: i, istat, nDevices type (cudaDeviceProp) :: prop istat = cudaGetDeviceCount(nDevices) do i = 0, nDevices-1 istat = cudaGetDeviceProperties(prop, i) write(*,"(' Device Number: ',i0)") i write(*,"(' Device name: ',a)") trim(prop%name) write(*,"(' Memory Clock Rate (KHz): ', i0)") & prop%memoryClockRate write(*,"(' Memory Bus Width (bits): ', i0)") & prop%memoryBusWidth write(*,"(' Peak Memory Bandwidth (GB/s): ', f6.2)") & 2.0*prop%memoryClockRate*(prop%memoryBusWidth/8)/10.0**6 write(*,*) enddo end program peakBandwidth
This code uses the function cudaGetDeviceCount() which returns in the argument nDevices the number of CUDA-capable devices attached to this system. Then in a loop we calculate the theoretical peak bandwidth for each device. CUDA enumerates devices starting from 0, hence the do loop from 0 to nDevices-1. The body of the loop uses cudaGetDeviceProperties() to populate the fields of the variable prop, which is an instance of the derived type cudaDeviceProp (defined in the cudafor module). The program uses only three of the many fields in the cudaDeviceProp: name, memoryClockRate, and memoryBusWidth.
When I compile (using the 12.6 version of the PGI compilers or newer) and run this code on a machine with a single NVIDIA Tesla C2050, I get the following result:
Device Number: 0 Device name: Tesla C2050 Memory Clock Rate (KHz): 1500000 Memory Bus Width (bits): 384 Peak Memory Bandwidth (GB/s): 144.00
This is the same value for theoretical peak bandwidth that we calculated in the previous post. When I compile and run the same code on my (rather old) laptop computer, I get the following output.
Device Number: 0 Device name: GeForce 8600M GT Memory Clock Rate (KHz): 635000 Memory Bus Width (bits): 128 Peak Memory Bandwidth (GB/s): 20.32
There are many other fields in the cudaDeviceProp type which describe the amounts of various types of memory, limits on thread block sizes, and many other characteristics of the GPU. We could extend the above code to print out all such data, but the standalone utility pgaccelinfo provided with the PGI compilers already performs this.
We will discuss many of the device attributes contained in the cudaDeviceProp type in future posts of this series, but I want to mention two important fields here: major and minor. These describe the compute capability of the device, which is typically given in major.minor format and indicates the architecture generation. The first CUDA-capable device in the Tesla product line was the Tesla C870, which has a compute capability of 1.0. The first double-precision capable GPUs, such as Tesla C1060, have compute capability 1.3. GPUs of the Fermi architecture, such as the Tesla C2050 used above, have compute capabilities of 2.x, and GPUs of the Kepler architecture have compute capabilities of 3.x. Many limits related to the execution configuration vary with compute capability, as shown in the following table.
|Tesla C870||Tesla C1060||Tesla C2050||Tesla K10||Tesla K20|
|Max Threads per Thread Block||512||512||1024||1024||1024|
|Max Threads per SM||768||1024||1536||2048||2048|
|Max Thread Blocks per SM||8||8||8||16||16|
In the first post of this series we mentioned that the grouping of threads into thread blocks mimics how thread processors are grouped on the GPU. This group of thread processors is called a streaming multiprocessor, denoted SM in the table above. The CUDA execution model issues thread blocks on multiprocessors, and once issued they do not migrate to other SMs. Multiple thread blocks can concurrently reside on a multiprocessor subject to available resources (on-chip registers and shared memory) and the limit shown in the last row of the table. The limits on threads and thread blocks in this table are associated with the compute capability and not just a particular device: all devices of the same compute capability have the same limits. There are other characteristics, however, such as the number of multiprocessors per device, that depend on the particular device and not the compute capability. All of these characteristics, whether defined by the particular device or its compute capability, can be obtained using the cudaDeviceProp type.
You can generate code for a specific compute capability by using the compiler option -Mcuda=ccXX, where XX indicates the compute capability (without the decimal point between major and minor). To see a list of compute capabilities for which a particular version of the compiler can generate code, along with other CUDA-related compiler options, issue the command pgf90 -Mcuda -help.
When you specify an execution configuration for a kernel, keep in mind (and query at run time) the limits in the table above. This is especially important for the second execution configuration parameter: the number of threads per thread block. If you specify too few threads per block, then the limit on thread blocks per multiprocessor will limit the amount of parallelism that can be achieved. If you specify too many threads per thread block, well, that brings us to the next section.
Handling CUDA Errors
All CUDA Fortran API functions have an integer return value which you can use to check for errors that occur during execution. In the example above we can check for successful completion of cudaGetDeviceCount() like this:
istat = cudaGetDeviceCount(nDevices) if (istat /= cudaSuccess) write(*,*) cudaGetErrorString(ierr)
We check to make sure cudaGetDeviceCount() returns the value cudaSuccess, which is defined in the cudafor module. If there is an error, then we call the function cudaGetErrorString() to get a character string describing the error.
Handling kernel errors is a bit more complicated since kernels are subroutines and do not have a return value. In addition, kernels execute asynchronously with respect to the host. To aid in error checking kernel execution, as well as other asynchronous operations, the CUDA runtime maintains an error variable that is overwritten each time an error occurs. The function cudaPeekAtLastError() returns the current error state, while the function cudaGetLastError() returns the error state and resets it to cudaSuccess.
We can check for errors in the saxpy kernel used in the first post of this series as follows.
call saxpy<<>>(x_d, y_d, a) ierrSync = cudaGetLastError() ierrAsync = cudaDeviceSynchronize() if (ierrSync /= cudaSuccess) write(*,*) & ’Sync kernel error:’, cudaGetErrorString(ierrSync) if (ierrAsync /= cudaSuccess) write(*,*) & ’Async kernel error:’, cudaGetErrorString(ierrAsync)
This code checks for both synchronous and asynchronous errors. Invalid execution configuration parameters, e.g. too many threads per thread block, are reflected in the value of ierrSync returned by cudaGetLastError(). (The CUDA Fortran compiler catches many of the synchronous errors, but it is a good idea to explicitly check as well.) Asynchronous errors which occur on the device after control is returned to the host, such as out-of-bounds memory accesses, require a synchronization mechanism such as cudaDeviceSynchronize(), which blocks the host thread until all previously issued commands have completed. Any asynchronous error is returned by cudaDeviceSynchronize(). We can also check for asynchronous errors and reset the runtime error state by modifying the last statement to call cudaGetLastError().
if (ierrAsync /= cudaSuccess) write(*,*) & ’Async kernel error:’, cudaGetErrorString(cudaGetLastError())
Device synchronization is expensive, because it causes the entire device to wait, destroying any potential for concurrency at that point in your program. So use it with care. Typically, I use preprocessor macros to insert asynchronous error checking only in debug builds of my code, and not in release builds.
Now you know how to query CUDA device properties and handle errors in CUDA Fortran programs. These are very important concepts for writing robust CUDA applications.
In the first three posts of this series we have covered some of the basics of writing CUDA Fortran programs, focusing on the basic programming model and the syntax of writing simple examples. We discussed timing code and performance metrics in the second post, but we have yet to use these tools in optimizing our code. We’ll begin optimizing in the next post, starting with data transfers between the host and device.