It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs.
Recently NVIDIA released the CUDA Toolkit version 5.5 with support for the IBM POWER architecture. Starting with CUDA 7, all future CUDA Toolkit releases will support POWER CPUs.
CUDA 7 is a huge update to the CUDA platform; there are too many new features and improvements to describe in one blog post, so I’ll touch on some of the most significant ones today. Please refer to the CUDA 7 release notes and documentation for more information. We’ll be covering many of these features in greater detail in future Parallel Forall posts, so check back often!
Support for Powerful C++11 Features
C++11 is a major update to the popular C++ language standard. C++11 includes a long list of new features for simpler, more expressive C++ programming with fewer errors and higher performance. I think Bjarne Stroustrup, the creator of C++, put it best:
C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.
OpenACC is a high-level programming model for accelerating applications with GPUs and other devices using compiler directives compiler directives to specify loops and regions of code in standard C, C++ and Fortran to offload from a host CPU to an attached accelerator. OpenACC simplifies accelerating applications with GPUs. An often-overlooked feature of OpenACC is its ability to interoperate with the broader parallel programming ecosystem. In this post I’ll teach you 3 powerful interoperability techniques for combining OpenACC and CUDA: the
host_data construct, the
deviceptr clause, and the
acc_map_data() API function.
I’ll demonstrate these techniques with several examples of mixing OpenACC with CUDA C++, CUDA Fortran, Thrust, and GPU-accelerated libraries. If you’d like to follow along at home, grab the source code for the examples from Github and try them out with your OpenACC compiler and the CUDA Toolkit. Don’t have an OpenACC compiler? You can download a free 30-day trial of the PGI accelerator compiler.
You may already be thinking to yourself, “If OpenACC is so great, why would I need to use it with CUDA?” OpenACC interoperability features open the door to the GPU-computing ecosystem, allowing you to leverage more than 10 years of code development. Need to multiply two matrices together? Don’t write your own function, just call the cuBLAS library, which has been heavily optimized for GPUs. Does your colleague already have a CUDA routine that you could use in your code? Use it! Interoperability means that you can always use the best tool for the job in any situation. Accelerate your application using OpenACC, but call an optimized library. Expand an existing CUDA application by adding OpenACC to unaccelerated routines. Your choice isn’t OpenACC or CUDA, it’s OpenACC and CUDA. Continue reading