Benchmarking GPUDirect RDMA on Modern Server Platforms

NVIDIA GPUDirect RDMA is a technology which enables a direct path for data exchange between the GPU and third-party peer devices using standard features of PCI Express. Examples of third-party devices include network interfaces, video acquisition devices, storage adapters, and medical equipment. Enabled on Tesla and Quadro-class GPUs, GPUDirect RDMA relies on the ability of NVIDIA GPUs to expose portions of device memory on a PCI Express Base Address Register region (BAR. See this white paper for more technical details).

Both Open MPI and MVAPICH2 now support GPUDirect RDMA, exposed via CUDA-aware MPI. Since January 2014 the Mellanox Infiniband software stack has supported GPUDirect RDMA on Mellanox ConnectX-3 and Connect-IB devices. See this post on the Mellanox blog for a nice introduction to the topic.

This post is a detailed look at the performance obtainable with available hardware platforms. The main audience for this post is designers and users of GPU-accelerated clusters employing CUDA-aware MPI, and architects and designers of GPU-accelerated low-latency systems, such as in healthcare, aviation, and high-energy physics. It is also complementary to a recent post (Exploring the PCIe Bus Routes) by Cirrascale.

Though the details may change in future hardware, this post suggests expected levels of performance and gives useful hints for performance verification. Continue reading

How to Build a GPU-Accelerated Research Cluster

Some of the fastest computers in the world are cluster computers. A cluster is a computer system comprising two or more computers (“nodes”) connected with a high-speed network. Cluster computers can achieve higher availability, reliability, and scalability than is possible with an individual computer. With the increasing adoption of GPUs in high performance computing (HPC), NVIDIA GPUs are becoming part of some of the world’s most powerful supercomputers and clusters. The most recent top 500 list of the worlds fastest supercomputers included nearly 50 supercomputers powered by NVIDIA GPUs, and the current world’s fastest supercomputer, Oak Ridge National Labs TITAN, utilizes more than 18,000 NVIDIA Kepler GPUs.

In this post I will take you step by step through the process of designing, deploying, and managing a small research prototype GPU cluster for HPC. I will describe all the components needed for a GPU cluster as well as the complete cluster management software stack. The goal is to build a research prototype GPU cluster using all open source and free software and with minimal hardware cost.

I gave a talk on this topic at GTC 2013 (session S3516 – Building Your Own GPU Research Cluster Using Open Source Software Stack). The slides and a recording are available at that link so please check it out!

There are multiple motivating reason for building a GPU-based research cluster. Continue reading

Benchmarking CUDA-Aware MPI

I introduced CUDA-aware MPI in my last post, with an introduction to MPI and a description of the functionality and benefits of CUDA-aware MPI. In this post I will demonstrate the performance of MPI through both synthetic and realistic benchmarks.

Synthetic MPI Benchmark results

Since you now know why CUDA-aware MPI is more efficient from a theoretical perspective, let’s take a look at the results of MPI bandwidth and latency benchmarks. These benchmarks measure the run time for sending messages of increasing size from a buffer associated with MPI rank 0 to a buffer associated with MPI rank 1. Using MVAPICH2-1.9b I have measured the following bandwidths and latencies between two Tesla K20 GPUs installed in two nodes connected with FDR infiniband. I have included host-to-host MPI bandwidth results as a reference. The measured latencies for 1 byte messages are 19 microseconds for regular MPI, 18 microseconds for CUDA-aware MPI with GPUDirect accelerated communication with network and storage devices, and 1 microsecond for host-to-host communication. The peak bandwidths for the 3 cases are 6.19 GB/s for host-to-host transfers, 4.18 GB/s for device-to-device transfers with MVAPICH2-1.9b and GPUDirect, and 1.89 GB/s for device-to-device transfers with staging through host memory.

MPIbandwidth Continue reading

An Introduction to CUDA-Aware MPI

MPI, the Message Passing Interface, is a standard API for communicating data via messages between distributed processes that is commonly used in HPC to build applications that can scale to multi-node computer clusters. As such, MPI is fully compatible with CUDA, which is designed for parallel computing on a single computer or node. There are many reasons for wanting to combine the two parallel programming approaches of MPI and CUDA. A common reason is to enable solving problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. With CUDA-aware MPI these goals can be achieved easily and efficiently. In this post I will explain how CUDA-aware MPI works, why it is efficient, and how you can use it.

I will be presenting a talk on CUDA-Aware MPI at the GPU Technology Conference next Wednesday at 4:00 pm in room 230C, so come check it out!

A Very Brief Introduction to MPI

Before I explain what CUDA-aware MPI is all about, let’s quickly introduce MPI for readers who are not familiar with it. The processes involved in an MPI program have private address spaces, which allows an MPI program to run on a system with a distributed memory space, such as a cluster. The MPI standard defines a message-passing API which covers point-to-point messages as well as collective operations like reductions. The example below shows the source code of a very simple MPI program in C which sends the message “Hello, there” from process 0 to process 1. Note that in MPI a process is usually called a “rank”, as indicated by the call to MPI_Comm_rank() below.

#include <stdio.h>
#include <string.h>
#include <mpi.h>

int main(int argc, char *argv[])
    char message[20];
    int myrank, tag=99;
    MPI_Status status;

    /* Initialize the MPI library */
    MPI_Init(&argc, &argv);
    /* Determine unique id of the calling process of all processes participating
       in this MPI program. This id is usually called MPI rank. */
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    if (myrank == 0) {
        strcpy(message, "Hello, there");
        /* Send the message "Hello, there" from the process with rank 0 to the
           process with rank 1. */
        MPI_Send(message, strlen(message)+1, MPI_CHAR, 1, tag, MPI_COMM_WORLD);
    } else {
        /* Receive a message with a maximum length of 20 characters from process
           with rank 0. */
        MPI_Recv(message, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
        printf("received %s\n", message);

    /* Finalize the MPI library to free resources acquired by it. */
    return 0;

This program can be compiled and linked with the compiler wrappers provided by the MPI implementation. Continue reading