CUDA 6

Unified Memory in CUDA 6

With CUDA 6, we’re introducing one of the most dramatic programming model improvements in the history of the CUDA platform, Unified Memory. In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and separated by the PCI-Express bus. Before CUDA 6, that is exactly how the programmer has to view things. Data that is shared between the CPU and GPU must be allocated in both memories, and explicitly copied between them by the program. This adds a lot of complexity to CUDA programs.

unified_memoryUnified Memory creates a pool of managed memory that is shared between the CPU and GPU, bridging the CPU-GPU divide. Managed memory is accessible to both the CPU and GPU using a single pointer. The key is that the system automatically migrates data allocated in Unified Memory between host and device so that it looks like CPU memory to code running on the CPU, and like GPU memory to code running on the GPU.

In this post I’ll show you how Unified Memory dramatically simplifies memory management in GPU-accelerated applications.  The image below shows a really simple example. Both codes load a file from disk, sort the bytes in it, and then use the sorted data on the CPU, before freeing the memory. The code on the right runs on the GPU using CUDA and Unified Memory.  The only differences are that the GPU version launches a kernel (and synchronizes after launching it), and allocates space for the loaded file in Unified Memory using the new API cudaMallocManaged().

simplified_memory_mananagement_codeIf you have programmed CUDA C/C++ before, you will no doubt be struck by the simplicity of the code on the right. Notice that we allocate memory once, and we have a single pointer to the data that is accessible from both the host and the device. We can read directly into the allocation from a file, and then we can pass the pointer directly to a CUDA kernel that runs on the device. Then, after waiting for the kernel to finish, we can access the data again from the CPU. The CUDA runtime hides all the complexity, automatically migrating data to the place where it is accessed.

What Unified Memory Delivers

There are two main ways that programmers benefit from Unified Memory.

Simpler Programming and Memory Model

Unified Memory lowers the bar of entry to parallel programming on the CUDA platform, by making device memory management an optimization, rather than a requirement. With Unified Memory, now programmers can get straight to developing parallel CUDA kernels without getting bogged down in details of allocating and copying device memory. This will make both learning to program for the CUDA platform and porting existing code to the GPU simpler. But it’s not just for beginners. My examples later in this post show how Unified Memory also makes complex data structures much easier to use with device code, and how powerful it is when combined with C++.

Performance Through Data Locality

By migrating data on demand between the CPU and GPU, Unified Memory can offer the performance of local data on the GPU, while providing the ease of use of globally shared data. The complexity of this functionality is kept under the covers of the CUDA driver and runtime, ensuring that application code is simpler to write. The point of migration is to achieve full bandwidth from each processor; the 250 GB/s of GDDR5 memory is vital to feeding the compute throughput of a Kepler GPU.

An important point is that a carefully tuned CUDA program that uses streams and cudaMemcpyAsync to efficiently overlap execution with data transfers may very well perform better than a CUDA program that only uses Unified Memory. Understandably so: the CUDA runtime never has as much information as the programmer does about where data is needed and when! CUDA programmers still have access to explicit device memory allocation and asynchronous memory copies to optimize data management and CPU-GPU concurrency. Unified Memory is first and foremost a productivity feature that provides a smoother on-ramp to parallel computing, without taking away any of CUDA’s features for power users.

Unified Memory or Unified Virtual Addressing?

CUDA has supported Unified Virtual Addressing (UVA) since CUDA 4, and while Unified Memory depends on UVA, they are not the same thing. UVA provides a single virtual memory address space for all memory in the system, and enables pointers to be accessed from GPU code no matter where in the system they reside, whether its device memory (on the same or a different GPU), host memory, or on-chip shared memory. It also allows cudaMemcpy to be used without specifying where exactly the input and output parameters reside. UVA enables “Zero-Copy” memory, which is pinned host memory accessible by device code directly, over PCI-Express, without a memcpy. Zero-Copy provides some of the convenience of Unified Memory, but none of the performance, because it is always accessed with PCI-Express’s low bandwidth and high latency.

UVA does not automatically migrate data from one physical location to another, like Unified Memory does. Because Unified Memory is able to automatically migrate data at the level of individual pages between host and device memory, it required significant engineering to build, since it requires new functionality in the CUDA runtime, the device driver, and even in the OS kernel. The following examples aim to give you a taste of what this enables.

Example: Eliminate Deep Copies

A key benefit of Unified Memory is simplifying the heterogeneous computing memory model by eliminating the need for deep copies when accessing structured data in GPU kernels. Passing data structures containing pointers from the CPU to the GPU requires doing a “deep copy”, as shown in the image below.

deep_copy

Take for example the following struct dataElem.

struct dataElem {
  int prop1;
  int prop2;
  char *name;
}

To use this structure on the device, we have to copy the struct itself with its data members, and then copy all data that the struct points to, and then update all the pointers in copy of the struct. This results in the following complex code, just to pass a data element to a kernel function.

void launch(dataElem *elem) {
  dataElem *d_elem;
  char *d_name;

  int namelen = strlen(elem->name) + 1;

  // Allocate storage for struct and name
  cudaMalloc(&d_elem, sizeof(dataElem));
  cudaMalloc(&d_name, namelen);

  // Copy up each piece separately, including new “name” pointer value
  cudaMemcpy(d_elem, elem, sizeof(dataElem), cudaMemcpyHostToDevice);
  cudaMemcpy(d_name, elem->name, namelen, cudaMemcpyHostToDevice);
  cudaMemcpy(&(d_elem->name), &d_name, sizeof(char*), cudaMemcpyHostToDevice);

  // Finally we can launch our kernel, but CPU & GPU use different copies of “elem”
  Kernel<<< ... >>>(d_elem);
}

As you can imagine, the extra host-side code required to share complex data structures between CPU and GPU code has a significant impact on productivity. Allocating our dataElem structure in Unified Memory eliminates all the excess setup code, leaving us with just the kernel launch, which operates on the same pointer as the host code. That’s a big improvement!

void launch(dataElem *elem) {
  kernel<<< ... >>>(elem);
}

But this is not just a big improvement in the complexity of your code. Unified Memory also makes it possible to do things that were just unthinkable before. Let’s look at another example.

Example: CPU/GPU Shared Linked Lists

linked_listLinked lists are a very common data structure, but because they are essentially nested data structures made up of pointers, passing them between memory spaces is very complex. Without Unified Memory, sharing a linked list between the CPU and the GPU is unmanageable. The only option is to allocate the list in Zero-Copy memory (pinned host memory), which means that GPU accesses are limited to PCI-express performance. By allocating linked list data in Unified Memory, device code can follow pointers normally on the GPU with the full performance of device memory. The program can maintain a single linked list, and list elements can be added and removed from either the host or the device.

Porting code with existing complex data structures to the GPU used to be a daunting exercise, but Unified Memory makes this so much easier. I expect Unified Memory to bring a huge productivity boost to CUDA programmers.

Unified Memory with C++

Unified memory really shines with C++ data structures. C++ simplifies the deep copy problem by using classes with copy constructors. A copy constructor is a function that knows how to create an object of a class, allocate space for its members, and copy their values from another object. C++ also allows the new and delete memory management operators to be overloaded. This means that we can create a base class, which we’ll call Managed, which uses cudaMallocManaged() inside the overloaded new operator, as in the following code.

class Managed {
public:
  void *operator new(size_t len) {
    void *ptr;
    cudaMallocManaged(&ptr, len);
    return ptr;
  }

  void operator delete(void *ptr) {
    cudaFree(ptr);
  }
};

We can then have our String class inherit from the Managed class, and implement a copy constructor that allocates Unified Memory for a copied string.

// Deriving from “Managed” allows pass-by-reference
class String : public Managed {
  int length;
  char *data;

public:
  // Unified memory copy constructor allows pass-by-value
  String (const String &s) {
    length = s.length;
    cudaMallocManaged(&data, length);
    memcpy(data, s.data, length);
  }

  // ...
};

Likewise, we make our dataElem class inherit Managed.

// Note “managed” on this class, too.
// C++ now handles our deep copies
class dataElem : public Managed {
public:
  int prop1;
  int prop2;
  String name;
};

With these changes, the C++ classes allocate their storage in Unified Memory, and deep copies are handled automatically. We can allocate a dataElem in Unified Memory just like any C++ object.

dataElem *data = new dataElem;

Note that You need to make sure that every class in the tree inherits from Managed, otherwise you have a hole in your memory map. In effect, everything that you might need to share between the CPU and GPU should inherit Managed. You could overload new and delete globally if you prefer to simply use Unified Memory for everything, but this only makes sense if you have no CPU-only data because otherwise data will migrate unnecessarily.

Now we have a choice when we pass an object to a kernel function; as is normal in C++, we can pass by value or pass by reference, as shown in the following example code.

// Pass-by-reference version
__global__ void kernel_by_ref(dataElem &data) { ... }

// Pass-by-value version
__global__ void kernel_by_val(dataElem data) { ... }

int main(void) {
  dataElem *data = new dataElem;
  ...
  // pass data to kernel by reference
  kernel_by_ref<<>>(*data);

  // pass data to kernel by value -- this will create a copy
  kernel_by_val<<>>(*data);
}

Thanks to Unified Memory, the deep copies, pass by value and pass by reference all just work. This provides tremendous value in running C++ code on the GPU.

A Bright Future for Unified Memory

One of the most exciting things about Unified Memory in CUDA 6 is that it is just the beginning.  We have a long roadmap of improvements and features planned around Unified Memory. Our first release is aimed at making CUDA programming easier, especially for beginners. Starting with CUDA 6, cudaMemcpy() is no longer a requirement. By using cudaMallocManaged(), you have a single pointer to data, and you can share complex C/C++ data structures between the CPU and GPU. This makes it much easier to write CUDA programs, because you can go straight to writing kernels, rather than writing a lot of data management code and maintaining duplicate host and device copies of all data. You are still free to use cudaMemcpy() (and particularly cudaMemcpyAsync()) for performance, but rather than a requirement, it is now an optimization.

Future releases of CUDA are likely to increase the performance of applications that use Unified Memory, by adding data prefetching and migration hints. We will also be adding support for more operating systems. Our next-generation GPU architecture will bring a number of hardware improvements to further increase performance and flexibility.

Find Out More

In CUDA 6, Unified Memory is supported starting with the Kepler GPU architecture (Compute Capability 3.0 or higher), on 64-bit Windows 7, 8, and Linux operating systems (Kernel 2.6.18+). To get early access to Unified Memory in CUDA 6, become a CUDA Registered Developer to receive notification when the CUDA 6 Toolkit Release Candidate is available. If you are attending Supercomputing 2013 in Denver this week, come to the NVIDIA Booth #613 and check out the GPU Technology Theatre to see one of my presentations about CUDA 6 and Unified Memory (Tuesday at 1:00 pm MTN, Wednesday at 11:30 am, or Thursday at 1:30 pm. Schedule here).

∥∀

About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
  • http://www.theoreticalchemist.com Fortran

    As someone who works with CUDA Fortran, I am hoping the day comes soon when NVIDIA/PGI Fortran also includes a similar functionality. I’d really like to get rid of all those freaking cudaMemcpy’s in my code!

  • http://www.markmark.net/ Mark Harris

    We will be rolling out for additional languages and platforms in future releases of CUDA (and CUDA Fortran).

  • terry spitz

    great functionality, definitely a move in the right direction for allowing porting existing code rather than rewriting. can we expect virtual function table rewiring for true C++ object copying to device? any support for STL on device (start with vector, shared_ptr) – even just read-only?

    • http://www.markmark.net/ Mark Harris

      The problem with virtual function tables is that AFAIK the C++ standard does not specify the format/layout/implenmentation of vftables. This makes it nearly impossible to support calling virtual functions on shared objects across all C++ host compilers supported by CUDA / nvcc. As for STL, that is something that we intend to look at, but nothing I can share here yet.

  • kl

    Very nice from CUDA 6. Really eager to get started with this.

    • http://www.markmark.net/ Mark Harris

      On current hardware, the latter — in a PC today the GPU and CPU memories are physically discrete. But the same programming model could be used for physically unified memory systems, such as Tegra SOCs.

  • AMRBlack

    Eager to get started with this version. And if I have a var like this “int *raw_ptr” with NxN size, can I have another var such as “int **ptrs” to point to the data of raw_ptr, ie “*ptrs[0]=raw_ptr[0];*ptrs[1]=raw_ptr[N-1]; ” ? Thanks a lot

    • http://www.markmark.net/ Mark Harris

      It’s all just memory, so yes. I didn’t mention in the above post, but there is also a `__managed__` declaration specifier, which allows you to declare global managed device pointers.

  • Adam MacDonald

    I’ve written a system for abstracting memory copies into my API, so the user can just use his data on the CPU and GPU seamlessly, using a checksum internally to determine if anything has changed and only transferring as late as necessary. Every part of the API is made more complex because of this. I’m really looking forward to just deleting all of that logic.

    • Det

      Yeah, but it’s still only supported from Kepler onwards.

      • mpeniak

        which is great, I wouldn’t want to use old cards seeing what the new post-fermi hw can do

  • Vincent Roberge

    Will CUDA Unified Memory be supported on GTX cards in Windows 7 and 8 or will it be limited to Tesla cards (due to requirement for TCC driver)? I am really looking forward to using Unified Memory in my CUDA applications, but do not want to limit my clients to using Tesla cards only.

    • http://www.markmark.net/ Mark Harris

      Unified Memory will be supported on Compute Capability 3.0 and later (sm_30 – so K10 or GTX 680 or later), on 64-bit Linux, Windows 7 and Windows 8 in CUDA 6.0. Support for other operating systems will come at a later date.

      • Peng Wang

        How about GTX 650 Ti or GT 650M? They are also listed as Compute Capability 3.0 and kepler structure.

        • http://www.markmark.net/ Mark Harris

          Yes and Yes.

  • Sung shine Bang

    Does unified memory support to overlap execution with data transfers on default stream? Or still do I need to split the operations with cudaMemcpyAsync and put them in separate streams for ovelapping?

    • http://www.markmark.net/ Mark Harris

      You can always use cudaMemcpyAsync to explicitly copy data and overlap it with kernels in other stream. Unified Memory does not take away your ability to optimize.

      In CUDA 6, pages from managed allocations that were touched on the CPU are migrated back to the GPU just before any kernel launch — so there is no overlap with that kernel. However you can still get overlap between multiple kernels in separate streams.

      Also, not discussed in this post, is an API in CUDA 6 that allows you to attach a managed allocation to a specific stream, so that you can control which allocations are synchronized on specific kernel launches, and increase concurrency.

      Future CUDA releases will add more optimizations, such as prefetching.

  • Eugeny Kuznetsov

    What about the constant memory. I would like to be able allocate it for example like this: int* pint = CudaConstMalloc();

    And free it like that: CudaConstFree(pint);

    • http://www.markmark.net/ Mark Harris

      Unfortunately due to the implementation of constant banks in the hardware this is not possible at this time.

  • Alexander Agathos

    Its a very nice article a small note on C++ this is a has-a class and not is-a class…so there is no need for inheritance. ;-)

    • http://www.markmark.net/ Mark Harris

      We want the class to satisfy “is a Managed class”, so I believe inheritance of Managed is warranted in this case. If you disagree, can you provide an example of how this would work with a has-a implementation?

      • Alexander Agathos

        Oh so deeply sorry about this, I was very absent minded now I see it. Yes of course it is has-a string and is-a managed. Apologies it was too late at night. :-)

        • Alexander Agathos

          And I must say you have demonstrated how well you can instantiate classes to be CUDA 6.0 managed objects thanks for putting this example. :-)

  • visionctrl

    Hi , Mark : Thanks for the exciting introduction to this important new feature. We are wondering if it is possible to pass the FPGA PCIE bar address ( we developped a FPGA PCIE board and GFDMA technology for DMA transfer between FPGA and GPU ) to GPU so that it can deep copy data from FPGA to GPU ? Thankyou !

    VisionCtrl Technology Co. , Ltd.

    • http://www.markmark.net/ Mark Harris

      This is not something that is possible with our current GPU architecture. Stay tuned.

      • Chad Augustine

        We have DMA’ing data directly into GPU’s memory since the Fermi devices. This is P2P transfers. What am I missing?

        • http://www.markmark.net/ Mark Harris

          Not sure I understand the question, but your question implies you haven’t read the blog. :) If so, you are missing a lot.

          • Chad Augustine

            @visionctrl:disqus wants to DMA data from GPU’s SDRAM via PCIe and into an FPGA and you mentioned that this is not possible.

            My point was that this we have been doing the reverse of this since the Fermi days w/ CUDA 4.0 and UVA. We push push data directly into GPU’s SDRAM. Peer-to-Peer gives this ability, actually in both directions.

            The key is latency. By moving data directly into GPU’s SDRAM, processing and then displaying it, we can completely by-pass the CPU’s SDRAM.

            Agreed, Unified Memory makes things simpler from a programming point of view when you need both CPU and GPU memory.

            What @visionctrl wants to do should be a lot easier, eh? Just need to run a cuda copy with the PCIe address of the FPGA PCIe BAR address. The DMA engine within the GPU doesn’t know if this adress is SDRAM or a FIFO implemented in an FPGA. It’s just PCIe.

            Same concept of P2P with two GPUs sharing data between each other. Expect in this case, it’s between a GPU and a FPGA…

            Thoughts?

  • visionctrl

    Is there any update about GPUDirect_RDMA technology in CUDA 6 ? We are looking for a similiar solution for WINDOWS os . Thankyou !

    • http://www.markmark.net/ Mark Harris

      You can see details about what’s new for GPUDirect RDMA in my SC13 talk on CUDA 6: http://bit.ly/1du71fi GPUDirect RDMA is not yet available on Windows.

  • Anderson

    Good night,

    I wonder if there are any scheduled for launch cuda 6 for registered users to date. I’ve seen that my vga is compatible (gt 640m).

    • http://www.markmark.net/ Mark Harris

      The CUDA 6 Release Candidate is available to registered developers now!

  • Rob Farber

    Love it! Four questions: (1) It sounds like I can potentially get full GPU memory bandwidth when operating locally within a single page (meaning the page is cached on the GPU) – correct? (2) Is only one page cached at a time on the GPU or are multiple pages cached (perhaps there is an LRU mechanism)? (3) In programming for performance, what is a reasonable minimum page size to assume and how many pages can I assume will be cached at a time on the GPU (i.e. 1, 10, 100, …). (4) I really like mapped memory but from a performance point of view it was of limited use because accessing outside a single cache line caused a PCI transfer and performance would plummet. Have we substituted accessing a single page rather than a single cache line in that performance limitation?

    • http://www.markmark.net/ Mark Harris

      (1) Yes, but pages are just the granularity at which dirtiness is tracked. (2) Absolutely not — don’t think of unified memory as a “page cache”. You have access to the entire GPU memory (several GBs), not just a few pages in a cache! (3) The default page size is the same as the OS page size today. In the future, we may expose control over this. (4) PCI-express performance is unchanged by Unified Memory. Unified Memory is preferable to mapped host memory because when the data is in device memory (which is the default location for cudaMallocManaged), GPU threads access it at the same performance as any device memory. And when it’s in host memory, CPU threads access it at the same performance as any host memory. The thing unified memory is doing is copying only the pages that the CPU (GPU) touches back to the host (device), automatically. On current hardware, coherence is at kernel launch and device sync only. I think one of the biggest benefits is the complex data structure / C++ data sharing this enables between host and device. Don’t get hung up on pages.

  • Chunyang Ma

    I can consider the revolution of the Unfied Memory will promote the CPU&GPU’s Unfied in the future! As the author say I am completely shocking when I seen this unify!

  • Frank Winter

    You say: ‘Don’t think about unified memory as a page cache on the device.’
    I think I have to disagree here. This is exactly what you should think
    if you’re referring to how things work under the hood. Otherwise I would
    be very surprised. Let me make the point. The virtual address space available
    to the CPU is much bigger than the physical memory on the GPU. (Let’s forget
    for a moment that all address spaces are mapped into one UVA space.)
    Let’s make an example: Suppose we are talking about a K20 with 6 GB and a
    total of 48 GB CPU memory. Let’s further assume one manage-allocates 10 chunks
    of 1 GB each. First question here: You say the default memory location is the
    GPU. What happens when allocating the 7th chunk? Does the first chunk get
    copied to CPU memory? Do we have an ‘out of memory’ error? Or is really a
    ‘first-touch allocation’ mechanism at work?

    Okay, let’s say the allocation of 10 chunks was successful. Now, suppose the
    user launches 10 kernels sequentially each using a different memory chunk:
    first kernel uses chunk 1, second kernel uses chunk 2, etc. I understand
    that unified memory leaves the memory on the device after a kernel launch.
    Thus, before launching the 7th kernel we have 6 chunks used by the previous
    kernels in GPU memory lying around. The 7th chunk cannot be copied by
    the manager to the GPU due to insufficient available memory. There must
    be some ‘spilling algorithm’ at work which decides which chunk to copy
    to the CPU in order to free memory for the 7th chunk. LRU comes to mind.

    Can you tell us whether there is a caching mechanism at work or whether
    unified memory is limited to the GPU memory size?

    • http://www.markmark.net/ Mark Harris

      On current hardware, Unified Memory does not allow oversubscribing GPU memory. You are limited on managed allocations to the amount of memory available on the GPU (smallest memory if there are multiple GPUs). Future hardware may support GPU page faulting which will enable us to oversubscribe GPU memory, and to do what you describe. In your example, I believe the allocations should start failing after you exceed available GPU memory. Today, pages are not migrated out of GPU memory unless they are touched by the CPU.

  • Frank Winter

    Our application framework allows the user to access a given data
    portion from both the CPU and the GPU. In order to provide a high
    performance in either case we employ different data layouts depending
    on the processor type that makes the access, e.g. AoS vs. SoA. Thus,
    we change the data layout on the fly when migrating data between GPU
    and CPU memory domains.

    Now, since unified memory does the data migration for you I guess it’s
    job is done by just copying the data. Thus I assume it’s not possible
    to manage-allocate a chunk of memory and pass a user-defined data
    layout transformation function to the malloc call. I am talking about
    a optional software hook that would get called when data was migrated
    by the driver/manager into a ‘staging area’ and from there it could be
    processed and stored into it’s final destination by a user-defined
    function.

    Such a feature would be nice to have.

    • http://www.markmark.net/ Mark Harris

      Unified Memory migration is at the page level. It would be very difficult to generically handle user-defined memory transformations like you describe at that level. I don’t know of any CPU allocators that apply transformations, for example. If it requires explicit memcopies anyway, then Unified Memory doesn’t gain you much.

      As I pointed out in the article, it’s going to be difficult for an automatic system like this to outperform custom-tuned memory management frameworks like you describe — the programmer usually has more information than the runtime or compiler. Since you already have a framework, there is no reason you can’t keep using it.

  • Coiby

    Is “elem->name” a typo error?

    dataElem is defined as:

    struct dataElem {
    int prop1;
    int prop2;
    char *text;
    }

    You should use elem->test indead.

    • Mark Ebersole

      Good catch Coiby! I’ll change the “char *text” in dataElem to “char *name”.