In the Trenches at GTC: CUDA 5 and Beyond

By Michael Wang, The University Of Melbourne, Australia (GTC 2012 Guest Blogger)

Following up the opening keynote by NVIDIA CEO and co-founder Jen Hsun-Huang, Mark Harris took the very same stage (albeit with a more intimate crowd) for his afternoon session entitled CUDA 5 And Beyond.

Mark walked us through the major features of the upcoming CUDA 5.0 release, and took some time to share a vision for the future of GPU and massively parallel programming generally.

The four aspects of CUDA 5 that Mark highlighted were:

Dynamic parallelism
GPU object linking
NVIDIA Nsight, Eclipse Edition
GPUDirect for clusters (RDMA)

First up was the concept of Dynamic Parallelism, made possible by the new Kepler architecture. What it means for developers is that we are no longer restricted to launching kernels from the host side only. In short, GPU Kernels Can Spawn More GPU Kernels.

As Mark says, “it’s just CUDA.” So you’ll be able to call any of the CUDA-accelerated library functions (which I covered in yesterday’s post) from within a kernel too. The big upside to all of this is the data stays on the GPU the whole time, which is a great saving for data-dependent kernel launches.

As Mark explained, dynamic parallelism was the secret in making the new n-body simulation possible (at 280k particles, this was a 10-fold increase over the previous generation demo). In the majority of our simulations the “interesting stuff” (and hence compute work) is non-uniformly distributed in the domain. With dynamic parallelism, the kernel can now detect regions where more “action” is taking place and devote more compute power (threads) to them. That makes me really happy!

Mark also went on to explain separate kernel compilation and linking within CUDA 5. Individual .cu files can now be compiled to .o objects and linked later on, rather than compiling a massive heap of .cu source files at once. This brings about support for static libraries as well as better support for closed-sources library distribution from 3^rd party developers.

GPUDirect RDMA in Kepler will allow one GPU to access another GPU’s memory across the network—an extension of the peer-to-peer mode of two devices on a single node we all know and love.

I’d like to finish by reflecting on Mark’s sentiments that CUDA has become more than a parallel extension to C, it is a parallel programming platform. And, its true success is not measured by the sales or adoption figures, but by the level of science that this technology has enabled in the past five years.

Going forward, Mark hinted at paradigms shifts on the parallel programming language front: from template-based C++, to directives such as OpenACC, to integration with domain specific languages such as MATLAB and R.

Be sure to watch the streamcast of this presentation : CUDA 5 and Beyond

About our Guest Blogger

Michael Wang is a Ph.D. candidate at The University of Melbourne, Australia, and an organizer of the Melbourne GPU Users Meetup.