cuda_pro_tip

CUDA Pro Tip: Profiling MPI Applications

When I profile MPI+CUDA applications, sometimes performance issues only occur for certain MPI ranks. To fix these, it’s necessary to identify the MPI rank where the performance issue occurs. Before CUDA 6.5 it was hard to do this because the CUDA profiler only shows the PID of the processes and leaves the developer to figure out the mapping from PIDs to MPI ranks. Although the mapping can be done manually, for example for OpenMPI via the command-line option --display-map, it’s tedious and error prone. A solution which solves this for the command-line output of nvprof is described here http://www.parallel-computing.pro/index.php/9-cuda/5-sorting-cuda-profiler-output-of-the-mpi-cuda-program . In this post I will describe how the new output file naming of nvprof to be introduced with CUDA 6.5 can be used to conveniently analyze the performance of a MPI+CUDA application with nvprof and the NVIDIA Visual Profiler (nvvp).

Profiling MPI applications with nvprof and nvvp

Collecting data with nvprof

nvprof supports dumping the profile to a file which can be later imported into nvvp. To generate a profile for a MPI+CUDA application I simply start nvprof with the MPI launcher and up to CUDA 6 I used the string “%p” in the output file name. nvprof automatically replaces that string with the PID and generates a separate file for each MPI rank. With CUDA 6.5, the string “%q{ENV}” can be used to name the output file of nvprof. This allows us to include the MPI rank in the output file name by utilizing environment variables automatically set by the MPI launcher (mpirun or mpiexec). E.g. for OpenMPI OMPI_COMM_WORLD_RANK is set to the MPI rank for each launched process.

$ mpirun -np 2 nvprof -o simpleMPI.%q{OMPI_COMM_WORLD_RANK}.nvprof ./simpleMPI
Running on 2 nodes
==18811== NVPROF is profiling process 18811, command: ./simpleMPI
==18813== NVPROF is profiling process 18813, command: ./simpleMPI
Average of square roots is: 0.667279
PASSED
==18813== Generated result file: simpleMPI.1.nvprof
==18811== Generated result file: simpleMPI.0.nvprof

Analyzing profiles with nvvp

The output files produced by nvprof can be either read by nvprof to analyze the profile one rank at a time (using --import-profile) or imported into nvvp. Since CUDA 6 it’s possible to import multiple files into the same time-line as described here. This significantly improves the usability of nvvp for MPI applications.

NVVP time line showing the GPU activity of two MPI processes.
NVVP time line showing the GPU activity of two MPI processes.

Enhancing profiles with NVTX

The analysis process can be further improved by using NVTX and naming the CPU threads and CUDA devices according to the MPI rank associate to them:

char name[256];
sprintf( name, "MPI Rank %d", rank );

nvtxNameOsThread(pthread_self(), name);
nvtxNameCudaDeviceA(rank, name);
NVVP time line with named OS thread and CUDA device showing the GPU activity of two MPI processes.
NVVP time line with named OS thread and CUDA device showing the GPU activity of two MPI processes.

Instead of naming the CUDA devices it’s also possible to name the GPU context:

char name[256];
sprintf( name, "MPI Rank %d", rank );
nvtxNameOsThread(pthread_self(), name);

CUcontext ctx;
cuCtxGetCurrent( &ctx );
nvtxNameCuContextA( ctx, name );
NVVP time line with named OS thread and CUDA context showing the GPU activity of two MPI processes.
NVVP time line with named OS thread and CUDA context showing the GPU activity of two MPI processes.

To guarantee that cuCtxGetCurrent picks the right context, it’s required that a CUDA Runtime call is made between the calls to cudaSetDevice and cuCtxGetCurrent.

Other Tools

To collect application traces and analyze the performance of MPI applications, well established and much more sophisticated tools like Score-P, Vampir or TAU exists. These tools use our profiling interface CUPTI to assess MPI+CUDA applications and also offer advanced support to detect MPI and CPU related performance issues.

Conclusion

Following the above approach many performance issues of MPI+CUDA applications can be identified with NVIDIA tools and NVTX can be used to improve working with these profiles. Besides the NVTX resource naming, everything described here works equally well with MPI+OpenACC applications.

∥∀

About Jiri Kraus

Jiri Kraus is a developer in NVIDIA's European Developer Technology team. As a consultant for GPU HPC applications at the NVIDIA Jülich Applications Lab, Jiri collaborates with local developers and scientists at the Jülich Supercomputing Centre and the Forschungszentrum Jülich. Before joining NVIDIA Jiri worked on the parallelization and optimization of scientific and technical applications for clusters of multicore CPUs and GPUs at Fraunhofer SCAI in St. Augustin. He holds a Diploma in Mathematics from the University of Cologne, Germany.