Pro Tip: Pinpointing Runtime Errors in CUDA Fortran

CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.
CUDA Fortran for Scientists and Engineers shows how high-performance application developers can leverage the power of GPUs using Fortran.

We’ve all been there. Your CUDA Fortran code is humming along and suddenly you get a runtime error: copyin, copyout, usually accompanied by FAILED in all caps.  In many cases, the error message gives you enough information to find where the problem is in your source code: you have a copyin runtime error and you only perform a few host-to-device transfers, or your code ran fine before you added that block of code earlier today—either way, you know where to look.

Then there are the cases where you run across a runtime error in a huge code with many device arrays, and many transfers between host and device occur when staging MPI transfers. How do you track down the line of source code that caused the error in this case?  Using cuda-gdb becomes unwieldy with large MPI codes. There is always the print statement approach, but do you really want to litter your code with print statements (only to forget to remove one when you’ve resolved the error)?

Fortunately there is a simple method to pinpoint the line of source code that causes a runtime error in CUDA Fortran that involves no code modification, only recompilation with -g if using optimization of -O2 or higher. We describe this method in this Pro Tip.

Example CUDA Fortran Code with Runtime Error

First we need a CUDA Fortran code that generates a runtime error, such as the following.

module m
Contains
  attributes(global) subroutine increment(a, b)
    implicit none
    integer :: a(:), b(:)
    integer :: i, n
    
    i = blockDim%x*(blockIdx%x-1) + threadIdx%x
    n = size(a)
    if (i <= n) a(i) = a(i)+b(i)

  end subroutine increment
end module m

program main
  use cudafor
  use m
  implicit none
  integer, parameter :: n = 1024*1024
  integer, allocatable :: a(:), b(:)
  integer, device, allocatable :: a_d(:), b_d(:)
  integer :: tPB = 256
  
  allocate(a(n), b(n), a_d(n))
  a = 1
  b = 3
  
  a_d = a
  b_d = b
  call increment<<<ceiling(real(n)/tPB),tPB>>>(a_d, b_d)
  a = a_d
  
  if (any(a /= 4)) then
    write(*,*) '**** Program Failed ****'
  else
    write(*,*) 'Program Passed'
  endif
  deallocate(a, b, a_d)
end program main

A quick inspection of this simple code reveals that the allocatable device array b_d is not allocated before use, and sure enough the code compiles fine but generate a runtime error.

% pgf90  -fast -g unalloc.cuf
% ./a.out
0: copyin Memcpy (dev=0x(nil), host=0x0x7f7e8986e230, size=5678900328) FAILED: 11(invalid argument)

We can track down the source line for this error in two simple steps: (1) generate a backtrace and (2) translate an address from the backtrace to a line number in a source code file.

Generating a Backtrace

You can generate a backtrace for code compiled with the PGI compilers by setting the environment variable PGI_TERM to trace before running the code. Doing this with the example code results in the following output.

% export PGI_TERM='trace'
% ./a.out
0: copyin Memcpy (dev=0x(nil), host=0x0x7f7406519230, size=18446744066755845160) FAILED: 11(invalid argument)
  /opt/pgi/linux86-64/17.10/lib/libpgf90_rpm1.so(__fort_abortx+0x17) [0x7f7408b6ea87]
  /opt/pgi/linux86-64/17.10/lib/libpgf90.so(__fort_abort+0x5f) [0x7f7408f4d64f]
  /opt/pgi/linux86-64/17.10/lib/libcudafor.so(+0x5dd8e) [0x7f741109fd8e]
  /opt/pgi/linux86-64/17.10/lib/libcudafor.so(pgf90_dev_copyin+0x53) [0x7f74110a0012]
  ./a.out() [0x403b1a]
  ./a.out() [0x4036d4]
  /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7f74074d7830]
  ./a.out() [0x403589]

The line just below the line with the copyin call from the libcudafor.so library is the statement from the executable a.out that generated the runtime error, which has the address 0x403b1a.  Now all you need to do is convert this address to the corresponding location in the source code.

Converting an Address to Line Information

The utility addr2line can be used to convert the address from the backtrace to the corresponding source file and line number. Just provide the executable and the address as follows.

% addr2line -e a.out 0x403b1a
/home/gruetsch/./unalloc.cuf:30

Here you can see that line 30 in the file unalloc.cuf is indeed the host-to-device transfer using the unallocated device array.

Summary

There are a number of tools available for debugging CUDA Fortran code, including cuda-gdb and cuda-memcheck, as well as this technique for locating the source runtime errors.  Which tool you use depends on your particular circumstances. The technique in this Pro Tip is attractive because it is simple and requires no source code modification. I hope you find it useful in resolving runtime errors encounter in your CUDA Fortran projects.

2 Comments
  • Xinsheng (Shawn) Qin

    I found that if you turn on optimization with -fast flag, those detailed error outputs are gone. What I got are:

    0: copyin Memcpy (dev=0x(nil), host=0x0x7f80c2e8b230, size=18446744073709550692) FAILED: 11(invalid argument)
    Error: segmentation violation, address not mapped to object

    • Greg Ruetsch

      Hi Xinsheng:

      Thanks for catching this. You are correct, in fact when using any optimization of -O2 or higher you also need to compile with -g to get the traceback. I’ve modified the text (including the example) to reflect this.

      Thanks again,

      Greg