CUDACasts Episode 18: CUDA 6.0 Unified Memory

CUDA 6 introduces Unified Memory, which dramatically simplifies memory management for GPU computing. Now you can focus on writing parallel kernels when porting code to the GPU, and memory management becomes an optimization.

The CUDA 6 Release Candidate is now publicly available. In today’s CUDACast, I will show you some simple examples showing how easy it is to accelerate code on the GPU using Unified Memory in CUDA 6, and how powerful Unified Memory is for sharing C++ data structures between host and device code. If you’re interested in looking at the code in detail, you can find it in the Parallel Forall repository on GitHub. You can also check out the great Unified Memory post by Mark Harris.

To suggest a topic for a future episode of CUDACasts, or if you have any other feedback, please use the contact form or leave a comment below.


About Mark Ebersole

As CUDA Educator at NVIDIA, Mark Ebersole teaches developers and programmers about the NVIDIA CUDA parallel computing platform and programming model, and the benefits of GPU computing. With more than ten years of experience as a low-level systems programmer, Mark has spent much of his time at NVIDIA as a GPU systems diagnostics programmer in which he developed a tool to test, debug, validate, and verify GPUs from pre-emulation through bringup and into production. Before joining NVIDIA, he worked for IBM developing Linux drivers for the IBM iSeries server. Mark holds a BS degree in math and computer science from St. Cloud State University. Follow @cudahamster on Twitter
  • Sash


  • Amir

    Hi Mark, I read somewhere that Maxwell GPUs can directly access system main memory. But I couldn’t find how this access is performed or any benchmarking about it. Do you know any document about it?

    • Mark Harris

      Fermi, Kepler, and Maxwell GPUs can all access host memory directly via what is known as “Zero Copy”. Zero copy basically maps a host pointer into the device address space and then the device accesses the memory over PCI-e. This is different from Unified Memory, which is available on Kepler and later GPUs. Zero copy performance is always limited to PCI-e throughput speeds. There is a bit of discussion in my post on Unified Memory. You may also want to look at the “Simple Zero-Copy” sample included with the CUDA Toolkit package, and the documentation of page-locked host memory and mapped memory here: