How NVLink Will Enable Faster, Easier Multi-GPU Computing

Accelerated systems have become the new standard for high performance computing (HPC) as GPUs continue to raise the bar for both performance and energy efficiency.  In 2012, Oak Ridge National Laboratory announced what was to become the world’s fastest supercomputer, Titan, equipped with one NVIDIA® GPU per CPU – over 18 thousand GPU accelerators.  Titan established records not only in absolute system performance but also in energy efficiency, with 90% of its peak performance being delivered by the GPU accelerators. This week, the U.S. Department of Energy (DoE) announced the award to IBM and NVIDIA to build two new flagship supercomputers, the Summit system at Oak Ridge National Laboratory and the Sierra system at Lawrence Livermore National Laboratory.

A new NVIDIA white paper explores key features of these new supercomputers and the technologies enabled by the Tesla® accelerated computing platform that will drive the U.S. DoE’s push toward exascale. Here’s a description of Summit and Sierra from the white paper.

The Oak Ridge National Laboratory (ORNL) Summit system will be a leadership computing platform for the Office of Science. Delivered in 2017, Summit is expected to reach between 150 and 300 petaFLOPS and is positioned as a precursor to the U.S. DoE’s exascale system.

As the lead federal agency supporting fundamental scientific research across numerous domains, the Office of Science is chartered to meet the insatiable need for computing resources by researchers and scientists. Summit will carry on the tradition set by Titan, ORNL’s current GPU-accelerated supercomputer, which is among the world’s fastest supercomputers today.

The Lawrence Livermore National Laboratory (LLNL) Sierra supercomputer will be the NNSA’s primary system for the management and security of the nation’s nuclear weapons, nuclear nonproliferation, and counterterrorism programs. In support of the complex mission of its Advanced Simulation and Computing program, LLNL has had numerous top-5 supercomputers, including most recently Sequoia, an IBM Blue Gene/Q system. Sierra will replace Sequoia and is expected to deliver more than 100 petaFLOPS, over 5x higher compute performance than its predecessor.

The Importance of Heterogeneous Nodes

The architectural emphasis on parallelism in GPUs leads to optimization for throughput, hiding rather than minimizing latency. Support for thousands of threads ensures a ready pool of work in the face of data dependencies in order to sustain performance at a high percent of peak. The memory hierarchy design and technology thoroughly reflect optimization for throughput performance at minimal energy per bit.

By contrast, latency-optimized CPU architecture drives completely different design decisions. Techniques designed to compress the execution of a single instruction thread into the smallest possible time demand a host of architectural features (like branch prediction, speculative execution, register renaming) that would cost far too much energy to be replicated for thousands of parallel GPU threads but that are entirely appropriate for CPUs.

The essence of the heterogeneous computing model is that one size does not fit all. Parallel and serial segments of the workload execute on the best-suited processor—latency-optimized CPU or throughput-optimized GPU—delivering faster overall performance, greater efficiency, and lower energy and cost per unit of computation.

ORNL and LLNL chose to build the Summit and Sierra pre-exascale systems around this powerful heterogeneous compute model using technologies from IBM and NVIDIA. IBM’s POWER CPUs are among the world’s fastest serial processors. NVIDIA GPU accelerators are the most efficient general purpose throughput-oriented processors on the planet. Coupling them together produces a highly efficient and optimized heterogeneous node capable of minimizing both serial and parallel sections of HPC codes.

The Growing Multi-GPU Trend

Since Titan, a trend has emerged toward heterogeneous node configurations with larger ratios of GPU accelerators per CPU socket, with two or more GPUs per CPU becoming common as developers increasingly expose and leverage the available parallelism in their applications. Although each of the new DoE systems is unique, they share the same fundamental multi-GPU node architecture.

While multi-GPU applications provide a vehicle for scaling single node performance, they can be constrained by interconnect performance between the GPUs. Developers must overlap data transfers with computation or carefully orchestrate GPU accesses over PCIe interconnect to maximize performance.  However, as GPUs get faster and GPU-to-CPU ratios climb, a higher performance node integration interconnect is warranted.  Enter NVLink.

NVLink: High-Speed GPU Interconnect

NVLink is an energy-efficient, high-bandwidth path between the GPU and the CPU at data rates of at least 80 gigabytes per second, or at least 5 times that of the current PCIe Gen3 x16, delivering faster application performance. NVLink is the node integration interconnect for both the Summit and Sierra pre-exascale supercomputers commissioned by the U.S. Department of Energy, enabling NVIDIA GPUs and CPUs such as IBM POWER to access each other’s memory quickly and seamlessly.  NVLink will first be available with the next-generation NVIDIA Pascal™ GPU in 2016.

In addition to speeding CPU-to-GPU communications for systems with an NVLink CPU connection, NVLink can have significant performance benefit for GPU-to-GPU (peer-to-peer) communications as well. A second new NVIDIA white paper focuses on these peer-to-peer benefits from NVLink, showing how systems with next-generation NVLink-interconnected GPUs are projected to deliver considerable application speedup compared to systems with GPUs interconnected via PCIe.

The white paper analyzes the performance benefit of NVLink for several algorithms and applications by comparing model systems based on PCIe-interconnected next-gen GPUs to otherwise-identical systems with NVLink-interconnected GPUs. GPUs are connected to the CPU using existing PCIe connections, but the NVLink configurations augment this with interconnections among the GPUs for peer-to-peer communication.

The paper examines five multi-GPU algorithms and applications important to HPC: exchange and sort, FFT, AMBER Molecular Dynamics (PMEMD), ANSYS Fluent Computational Fluid Dynamics (CFD), and QUDA Lattice Quantum Chromodynamics (LQCD). Projected results for exchange, sort, and 3D FFT are shown in Figures 1, 2, and 3, respectively.

Figure 1: Multi-GPU exchange performance in 2-GPU and 4-GPU configurations, comparing NVLink-based system to PCIe-based system.
Figure 1: Projected multi-GPU exchange performance in 2-GPU and 4-GPU configurations, comparing NVLink-based systems to PCIe-based systems.
Figure 2: Projected multi-GPU sorting performance in 2-GPU and 4-GPU configurations, comparing NVLink-based systems to PCIe-based systems.
Figure 3: 3D FFT performance in 2-GPU configurations. NVLink-connected GPUs deliver over 2x speedup.
Figure 3: Projected 3D FFT performance in 2-GPU configurations. NVLink-connected GPUs deliver over 2x speedup.

The whitepaper presents the analysis of performance considerations for each application, with the result that NVLink is projected to deliver significant performance boost – up to 2x in many applications – simply by replacing the PCIe interconnect for communication among peer GPUs. This clearly illustrates the growing challenge NVLink addresses: as the GPU computation rate grows, GPU interconnect speeds must scale up accordingly in order to see the full benefit of the faster GPU.

NVLink is a flexible and scalable interconnect technology, enabling a rich set of design options for next-generation servers to include multiple GPUs with a variety of interconnect topologies and bandwidths, as Figure 4 shows.

Figure 4: NVLink will enable flexible configuration of multiple GPU accelerators in next-generation servers.
Figure 4: NVLink will enable flexible configuration of multiple GPU accelerators in next-generation servers.

Faster, Easier Programming

Figure 5: The combination of Unified Memory and NVLink will enable faster, easier data sharing between CPU and GPU code.
Figure 5: The combination of Unified Memory and NVLink will enable faster, easier data sharing between CPU and GPU code.

One important result of the higher bandwidth between GPUs provided by NVLink will be that libraries such as cuFFT and cuBLAS can offer much better multi-GPU scalability, scaling onto a greater number of GPUs as well as strong scaling smaller problems where communication is a significant bottleneck today.

Unified Memory and NVLink represent a powerful combination for CUDA® programmers. Unified Memory provides you with a single pointer to data and automatic migration of that data between the CPU and GPU. With 80 GB/s or higher bandwidth on machines with NVLink-connected CPUs and GPUs, that means GPU kernels will be able to access data in host system memory at the same bandwidth the CPU has to that memory—much faster than PCIe. Host and device portions of applications will be able to share data much more efficiently and cooperatively operate on shared data structure, and supporting larger problem sizes will be easier than ever.

While Summit and Sierra are based on POWER CPUs, CUDA enables common programming approaches across a variety of heterogeneous systems, with support for x86, ARM, and POWER. You can program all of these systems with GPU-accelerated libraries, OpenACC compiler directives, or CUDA-accelerated programming languages. So get started now to get your applications GPU-accelerated and ready to take advantage of NVLink.

For more details on the Summit and Sierra supercomputers and their projected performance, visit For more details on NVLink and projected application speedups, visit

  • Gaurabh Singh Tomar

    When will NVLink come to normal consumer level computers? Or is it too expensive for that segment?


    Are we talking the release of this along with Pascal sometime in 2106?

    I would love to hear more on who and what are going to be “behind” this particular hardware. This is something that warranties a possible revamp of the DeskStar!!

  • drmaddogs

    So, Pascal card, it looks to be compatible with Pci-e slots (8? 16x?) but limited about 25% compare to using it in a NVLink. To say then, MBoards need updating and while it is stated ‘servers’ will best be accommodated, I’ve no idea what chipset design will be necessary to handle a true NVLink for the Pascal Vid card.


    Consumer level…..?……? When….???

    • At this time, NVLink is used only for GPU-to-GPU communication, with the only exception of IBM Power processors. At the consumer level you are unlikely to buy multiple Pascal GPUs per host, and x86 machines don’t “speak” NVLink, so you won’t have access to NVLink.


        Som’bitch…… Thank you though….

  • anonomous72663

    HBM2 memory living on a graphics card should be accessible by the CPU and even unified. At least, it should be possible with the new cards in 2016, what with their estimated 32GB of HBM2 memory, to dispense with traditional motherboard DDR and allow the processor to directly access the graphics card memory as its own with all the advantages of the significantly increased bandwidth, where perhaps DDR4 / 5 mobo memory acts as a swap out before hitting the disk swap file.

  • Scott Harrell

    The Summit and Sierra supercomputer announcement indicates that Intel’s Xeon Phi chips are duds. The Department of Energy surely evaluated the Xeon Phi family and found it inferior to Nvidia’s chips.

    Intel’s Skylake-EP processors (expected H1 2017) are suppose to have an “Omni-Path Interconnect” called Storm Lake. According to some reports, this is a low latency interconnect fabric acquired from Cray in 2012. I guess Storm Lake is Intel’s version of NVLink.

    I wish Nvidia would develop a workstation CPU so I wouldn’t have to pay an arm and a leg for Xeon CPUs. The socket for Nvidia’s CPU could provide 80 GByte/sec NVLink connections to each of two GPUs. The pins of the GPU used for PCIe could be repurposed for NVLink when the GPU is used with Nvidia’s CPU. I would prefer one or two fast CPU cores (> 6 GHz) instead of 22+ medium speed CPU cores (2.5 GHz). This CPU would be ideal for gamers, engineers and scientists. The GPUs would do the heavy lifting and the fast CPU would combat Amdahl’s law.

    • Martin_2016

      Storm Lake is for communication between nodes, so it competes with Infiniband. NVLink is for communication within a node. It would be helpful to both Nvidia and AMD if AMD includes NVLink on their CPUs and GPUs. Otherwise, AMD will have to develop AMDLink and there will be no way to connect an Nvidia GPU to an x86 CPU with this type of interconnect.

  • John Alexander

    Now is a good time for disruption in the personal computer market because a lot of people are unhappy with Windows 10. I hope Nvidia has some of their software engineers contributing to Linux desktop projects because that will increase the market for Nvidia’s CPU. In addition to that, I bet there is a startup developing a new commercial operating system for desktops and notebooks. Nvidia should help fund them.

    A CPU with NVLink will be great. Nvidia should license the Mill CPU or acquire the company Mill Computing Inc. The Mill CPU is a truly impressive new CPU design. When only one thread is being used, Intel’s Haswell-E runs at its turbo frequency of 3.6 GHz. A good target for Nvidia’s CPU would be at least 1.5x or 2x of Intel’s single-thread SPECint performance.