CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More

It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs.

Recently NVIDIA released the CUDA Toolkit version 5.5 with support for the IBM POWER architecture. Starting with CUDA 7, all future CUDA Toolkit releases will support POWER CPUs.

CUDA 7 is a huge update to the CUDA platform; there are too many new features and improvements to describe in one blog post, so I’ll touch on some of the most significant ones today. Please refer to the CUDA 7 release notes and documentation for more information. We’ll be covering many of these features in greater detail in future Parallel Forall posts, so check back often!

Support for Powerful C++11 Features

C++11 is a major update to the popular C++ language standard. C++11 includes a long list of new features for simpler, more expressive C++ programming with fewer errors and higher performance. I think Bjarne Stroustrup, the creator of C++, put it best:

C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. In your device code, you can now use new C++ language features like auto, lambda, variadic templates, static_assert, rvalue references, range-based for loops, and more.

Here’s a little example I whipped up to demonstrate using C++ auto, lambdas, std::initializer_list, and range-based for loops in CUDA device code. This program defines a generic find routine and then uses it in a kernel with a lambda function to customize its use to count occurences in a text of a list of letters.

#include <initializer_list>
#include <iostream>
#include <cstring>

// Generic parallel find routine. Threads search through the
// array in parallel. A thread returns the index of the 
// first value it finds that satisfies predicate `p`, or -1.
template <typename T, typename Predicate>
__device__ int find(T *data, int n, Predicate p)
    for (int i = blockIdx.x * blockDim.x + threadIdx.x;
         i < n;
         i += blockDim.x * gridDim.x)
        if (p(data[i])) return i;
    return -1;

// Use find with a lambda function that searches for x, y, z
// or w. Note the use of range-based for loop and 
// initializer_list inside the functor, and auto means we 
// don't have to know the type of the lambda or the array
void xyzw_frequency(unsigned int *count, char *data, int n)
    auto match_xyzw = [](char c) {
      const char letters[] = { 'x','y','z','w' };
      for (const auto x : letters) 
        if (c == x) return true;
      return false;

    int i = find(data, n, match_xyzw);

    if (i >= 0) atomicAdd(count, 1);

int main(void)
    char text[] = "zebra xylophone wax";
    char *d_text;

    cudaMalloc(&d_text, sizeof(text));
    cudaMemcpy(d_text, text, sizeof(text), 
    unsigned int *d_count;
    cudaMalloc(&d_count, sizeof(unsigned int));
    cudaMemset(d_count, 0, sizeof(unsigned int));

    xyzw_frequency<<<1, 64>>>(d_count, d_text, 

    unsigned int count;
    cudaMemcpy(&count, d_count, sizeof(unsigned int), 

    std::cout << count << " instances of 'x', 'y', 'z', 'w'"
              << "in " << text << std::endl;


    return 0;

Here’s how I compiled and ran this code on my Macbook Pro (GeForce GT 750M), and the output.

$ nvcc --std=c++11 -o c++11_cuda
$ ./c++11_cuda 
5 instances of 'x', 'y', 'z', or 'w' in zebra xylophone wax

New Capabilities and Higher Performance for Thrust

thrust_logoCUDA 7 includes a brand-new release of Thrust, version 1.8. Modeled after the C++ Standard Template Library, the Thrust library brings a familiar abstraction layer to the realm of parallel computing, providing efficient and composable parallel algorithms that operate on vector containers.

Thrust 1.8 introduces support for algorithm invocation from CUDA __device__ code, support for CUDA streams, and algorithm performance improvements. Users may now invoke Thrust algorithms from CUDA __device__ code, providing a parallel algorithms library to CUDA programmers authoring custom kernels, as well as allowing Thrust programmers to nest their algorithm calls within functors. The thrust::seq execution policy allows you to enforce sequential algorithm execution in the calling thread and makes a sequential algorithms library available to individual CUDA threads. The .on(stream) syntax lets you specify a CUDA stream for kernels launched during algorithm execution.

Thrust 1.8 also includes new CUDA algorithm implementations with substantial performance improvements. Here are some example measured improvements on a Tesla K20c accelerator for large problem sizes (using the CUDA Thrust backend):

  • thrust::sort is 300% faster for user-defined types and 50% faster for primitive types;
  • thrust::merge is 200% faster;
  • thrust::reduce_by_key is 25% faster;
  • thrust::scan is 15% faster.

See the Thrust 1.8 CHANGELOG for a full list of improvements since v1.7.

cuSOLVER: A Powerful New Direct Linear Solver Library

In CUDA 7, a new library joins the growing suite of numerical libraries for accelerated computing. cuSOLVER provides dense and sparse direct linear solvers and Eigen Solvers.

The intent of cuSOLVER is to provide useful LAPACK-like features, such as common matrix factorization and triangular solve routines for dense matrices, a sparse least-squares solver and an eigenvalue solver. In addition cuSOLVER provides a new refactorization library useful for solving sequences of matrices with a shared sparsity pattern.

cuSolver running on a Tesla GPU can provide large speedups compared to running on a CPU, as you can see in Figures 1 and 2.

cuSOLVER DN speedups
Figure 1: cuSOLVER 7.0 dense speedups on a Tesla K40 GPU compared to MKL 11.0.4 on an Intel Xeon E5-2697 v3 CPU @ 3.60 GHz with 14 cores
cuSOLVER Sparse Speedups
cuSOLVER 7.0 Sparse Speedups on a Tesla K40 GPU compared to an Intel Xeon E5-2697 v3 CPU @ 3.60 GHz with 14 cores.

cuFFT Performance Improvements

cuFFT 7.0 improves FFT performance by up to 3.5x for sizes that are composite powers of 2, 3, 5, and 7. Figure 2 shows speedups of cuFFT 7.0 vs. cuFFT 6.5 for 1D FFTs, and Figure 3 shows speedups for 3D FFTs. (Experiments were performed on a Tesla K20c with ECC ON, batched transforms on 32M total elements, input and output data on device.)

cuFFT 7.0 1D Speedups
Figure 2: 1D FFT speedups in CUDA 7 vs. CUDA 6.5
igure 2: 3D FFT speedups in CUDA 7 vs. CUDA 6.5
Figure 3: 3D FFT speedups in CUDA 7 vs. CUDA 6.5

Runtime Compilation

CUDA Runtime Compilation Flow
The new Runtime Compilation library (nvrtc) provides an API to compile CUDA-C++ device source code at run time. You can launch the resulting compiled PTX on a GPU using the CUDA Driver API. Runtime Compilation enables run-time code generation, and run-time specialization of CUDA kernel code, with much lower overhead compared to launching nvcc from your application at run time.

Many CUDA developers optimize CUDA kernels using template parameters, which allows them to generate multiple optimized versions of a kernel at compile time for use under different circumstances. Consider the following (simplified) example. Here we have a loop for which the number of iterations depends on the thread block size (a parallel reduction might use this pattern).

__device__ void foo(float *x) {
  for (int i = 1; i <= blockDim.x; i *= 2) {
    doSomething(x, i);

If we know the block size at compile time, we can hard code the loop limit, which enables the compiler to unroll the loop. But we might need to support multiple block sizes, so it’s more flexible to use a template parameter.

template <int blocksize>
__device__ void foo(float *x) {
  #pragma unroll
  for (int i = 1; i <= blocksize; i *= 2) {
    doSomething(x, i);

But template parameter values must be constant at compile time, so to use this code for multiple block sizes, we have to hard-code all of the block sizes we want to support in a switch or if/else block. That’s painful. But with Runtime Compilation, run time is compile time, so we can simply generate and compile the exact version of the kernel we need based on run-time values. This run-time code specialization can result in highly tuned code.

(Note: Runtime Compilation is a preview feature in CUDA 7.0 and any or all parts of this specification are subject to change in the next CUDA release.)

Much More to Explore

This brief look at CUDA 7 should give you a feeling for how powerful this new release is, but we’ve barely scratched the surface. To mention a few other features, CUDA 7 supports GPU Core Dumps for easier remote and cluster debugging; new CUDA Memcheck tools for detecting uninitialized data and improper synchronization; and multi-GPU support in the CUDA multi-process server; and support for new platforms and compilers.

So don’t wait, the CUDA Toolkit version 7 Release Candidate is available right now. It even has a great new network installer that only downloads and installs the CUDA Toolkit components that you need, saving time and bandwidth.

Download CUDA today at

Learn More in Upcoming Webinars and at GTC 2015

Join us for two CUDA 7 webinars:

Want to learn more about accelerated computing on the Tesla Platform and about GPU computing with CUDA? Come to the GPU Technology Conference, the world’s largest and most important GPU developer conference, March 17-20 at the San Jose Convention Center. Readers of Parallel Forall can use the discount code GM15PFAB to get 20% off any conference pass! Register by January 20 to take advantage of early bird pricing.

  • rnickb

    do you support any c++14 features yet? like auto-deduced return types?

    • CUDA 7 does not officially support C++14 features, and from my quick tests, features like auto-deduced return types, generic lambdas, etc., are not working yet.

      • Aditya Ramesh

        Is there an ETA for when CUDA will add support for C++14? It’s great that CUDA now supports the use of modern C++ paradigms. But I already make heavy use of C++14 features in some of my projects, and I am very eager to add support for CUDA to them. I’d love to hear about what the plans are regarding C++14 support.

        • It would be helpful to hear from you and others which C++14 features are most important to you, and how you would like to use them in __device__ code. Thanks!

          • Aditya Ramesh

            Thanks for the amazingly fast response! As an example, one codebase I work on is an nd array library similar to Eigen, except that it allows loop optimizations (e.g. parallelization, tiling, unrolling, permutation) to be applied when the RHS of an expression is assigned. I am _very_ interested in adding GPU “gridification” as an additional optimization to this library: this would allow the user to write concise expressions involving arrays (using C++14 equivalents of numpy’s syntax), and choose to evaluate the expression either on the CPU or the GPU, without rewriting any code. Generic lambdas are used internally in the library to decompose loop nests and apply the loop optimizations. Variable templates are used to allow concise shorthand, e.g. “cr” to express the constant range [0, 10] with stride 2. This lets me write things like “arr(cr) = 5”. This library is being used as the (CPU) backend for a new deep neural network library. With support for the appropriate C++14 features (auto return types, generic lambdas, variable templates), the DNN library would benefit from GPU acceleration without rewriting much code. Please let me know if you would like more details, code, etc.

          • Your project sounds interesting Aditya– is it open source? What do you mean by “C++14 equivalents of bumpy’s syntax”?

          • Aditya Ramesh

            Here’s a link to a page in the Github repo describing the project: I haven’t advertised the project publicly yet, as I am still implementing the last few features and writing the documentation. The ETA until I publicly release the project is two weeks. Being able to generate GPU code in C++ using the terse syntax described in that page is why I was very excited to learn that CUDA is beginning to support modern C++ features. Please let me know if you would like any more information.


    I’ve already used some c++11 features in device code long before CUDA 7, with VS2010, does it only mean that it wasn’t officially supported until CUDA 7? Can I use lambda as device functor now?

    • You can use lambda in device code (as a functor or otherwise) as long as its definition is in device code. You can’t (yet) pass a lambda from host code to device code (i.e. as a kernel argument).

      I’m curious which C++11 features you were able to use in device code in the past? There was an undocumented option (–std=c++11) in CUDA 6.5, but not before that. But mvcc does use the EDG C++ front end so it’s possible some features that require front-end compilation only may have worked if they were supported by the version of EDG used.

      In any case, CUDA 7 is the first version with official support. Note that not everything in C++11 is supported on the device at this stage. It’s mostly language features, not standard library features, like std::thread or STL. We plan to provide more detailed information in a future blog post.

      • TONGARI J

        All the features you mentioned except range-based for loop & variadic templates which aren’t supported by the host compiler (VC10), all work fine in CUDA 6.5 with VS2010, no special compiler flags needed.

      • moops

        I guess the big question is “Does NVIDIA see a conflict with std::thread and std::atomic semantics and NVIDIA parallel constructs?”

        • Can you elaborate?

          • moops

            having code like

            void task();
            std::thread t(task);

            an explicit task fork-join parallel model. Would a program that exposes it’s parallelism using C++11 threading be mappable to a NVIDIA GPU by nvcc?

            I think you could support the use of std::atomic as pass-through code to CUDA atomics, or built from them while keeping GPU thread execution semantics. Probably std::mutex is buildable as well.

            Perhaps a subset of std::thread programs could be efficiently mapped to a SIMT hardware design but it would take a very clever compiler to even discover a person was trying to write SIMT code with std::thread. Also std::thread supports a whole bunch of parallel programming styles that perhaps will never map efficiently onto a GPU architecture.

            I just never hear anyone at NVIDIA even mention std::thread, even the ones that are very involved in the C++ standards development.

          • std::thread, like pthreads, is not currently compatible with GPU execution. These C++ features are something we consider a possibility for the future.

      • Since time has passed and CUDA 7.5 is out, let me correct my above comment. With CUDA 7.5 there is a new experimental feature “GPU Lambdas”, which allows you to define a lambda in host code with a __device__ annotation and pass it to a kernel. This effectively gives you the ability to “launch” a lambda. See the CUDA 7.5 features post:

    • This is because MSVC enables C++11 support without any flags / options specified. But CUDA 6.5 does not officially support C++11

  • Vimal Thilak

    Mark: Happy days! Looks like I can use libc++ on OS X. This means that I no longer have to maintain other dependencies because of libstdc++ dependency!

  • dwash59

    are the 3D FFT improvements only on the K20 or other GPUs as well?

    • The cuFFT improvements are not limited to K20 (I fixed the confusing wording in the post). Also, they are not limited to 3D FFTs! I’ve added a graph showing speedups for 1D FFTs.

  • Axel Huebl

    Once again: that’s a fantastic feature set in this release! We are really looking into the constexpr support on the device side (and to throw out a huge amount of self-written auto, lambda features).

    One unrelated quick question: was the support for the PGI compiler on the host-side added (#439486)?

  • SwAY

    cuSOLVER is great news for the signal processing community ! Is it possible to stream cuSOLVER functions in order to use them in batch mode (to compute many medium size matrices) ? It would be interesting to compare cuSOLVER with the batched solver sample code available in the registered dev website.

    • Yes, cuSOLVER supports CUDA streams. Also, cuSolver contains some batched operations: batched sparse QR and batched refactorization. The cuSOLVER PDF documentation included with the CUDA Toolkit v7 RC download provides full details.

  • Jürgen Bohl

    Does someone happen to know whether this new release of cuFFT does support callbacks (cufftXTSetCallback etc.) on Windows?

    • Ujval Kapasi

      No, cufftXTSetCallback is not supported on Windows in this release.

      • Jürgen Bohl

        Well, it’s a pity. Nevertheless, thanks a lot for your reply!

  • Peter V./Vienna/Austria/Europe

    apparently I am too blind and cannot find the setting in CUDA NSight 7.0 which enables c++11 standard. Can you please help me and tell me where to enable this option? I am using NSight to compile/link the project.

  • Will CUDA 7 support 32-bit Windows?

    The release notes document is rather unclear (says that CUDA Toolkit wold be 64-nit only)…

  • Alexander Y. Chen

    I’m extremely interested in the nvrtc runtime compilation library. However in the samples there are only 3 applications under 4_Finance using this library and none of them makes use of templates. It’s not apparent how you would pass in template parameters, especially the parameters deduced by host compiler. Is there a sample where runtime compilation of templated device code is used? Or we should wait for the “libnvrtc user guide” which was mentioned in the release notes?

  • pudtaan

    xyzw_frequency<<>>(d_count, d_text, strlen(text));

    When use one dimesion the result is not correct. How to solve it.

    • This is just a trivial example meant to demonstrate the language features, not to handle all configurations. Specifically, as written, it requires there to be at least as many threads running as there are matches. I realize this is kind of a silly limitation, but again, I whipped it up quickly just to demonstrate language features. I’m sure there are better ways to parallelize this and make it more general.

  • dancy

    is there some cuSolverDN examples available? There is no sample in the path CUDA Samplesv7.0, and it seems only mathematical step description without exact code in Appendix C of the cuSOLVER PDF document, Moreover what’s the meaning of trsm mentioned in step3?

  • John Bray

    I pasted your example code into a file, and got

    nvcc -V
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2014 NVIDIA Corporation
    Built on Tue_Dec__9_18:10:46_CST_2014
    Cuda compilation tools, release 7.0, V7.0.17

    nvcc –std=c++11 -o c++11_cuda internal error: assertion failed: remove_from_variables_list: not found (/dvs/p4/build/sw/rel/gpu_drv/r346/r346_00/drivers/compiler/edg/EDG_4.9/src/il.c, line 13467)

    1 catastrophic error detected in the compilation of “/tmp/tmpxft_0000ad8c_00000000-9_c++11_cuda.cpp1.ii”.
    Compilation aborted.
    nvcc error : ‘cudafe’ died due to signal 6
    nvcc error : ‘cudafe’ core dumped

    Is there anything I can do about this?

    • Sorry about this John — this is a known bug in nvcc in the CUDA 7.0 RC, and it’s already fixed in the version that will be in the CUDA 7.0 official release (and note it doesn’t affect mac/clang). The workaround is to declare the list of letters in its own variable — in fact I’ve updated the code in the post to use the workaround so it doesn’t bite anyone else.

      • John Bray

        Revised version compiles. Thanks

  • pSz

    Let me thank here too Mark for the recent updates to this article which swapped the previous highly questionable performance benchmarks (CUDA 7.0 on GPU vs MKL on CPU) with relevant and much more reasonable comparisons.

    Kudos for the prompt action and keep up the good work!

    Reminder: the cuSOLVER home page ( still needs fixing.

  • Arne Kreutzmann

    why is intel compiler supported and c++11 but not the combination?

    nvcc warning : The -c++11 flag is not supported with the configured host compiler. Flag will be ignored.


  • bahar

    I want to compile a cuda based program with c++11 features in Nsight Eclipse. The Cuda version is 7.5. Although I enabled the flag for -std==c++11 in Nsight Eclipse setting, still I get this error:
    “nvcc warning : The -c++11 flag is not supported with the configured host compiler. Flag will be ignored”. Do you know how can I solve this problem?

    • What is your host compiler? What version? If gcc, try `gcc -version`