CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More

It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs.

Recently NVIDIA released the CUDA Toolkit version 5.5 with support for the IBM POWER architecture. Starting with CUDA 7, all future CUDA Toolkit releases will support POWER CPUs.

CUDA 7 is a huge update to the CUDA platform; there are too many new features and improvements to describe in one blog post, so I’ll touch on some of the most significant ones today. Please refer to the CUDA 7 release notes and documentation for more information. We’ll be covering many of these features in greater detail in future Parallel Forall posts, so check back often!

Support for Powerful C++11 Features

C++11 is a major update to the popular C++ language standard. C++11 includes a long list of new features for simpler, more expressive C++ programming with fewer errors and higher performance. I think Bjarne Stroustrup, the creator of C++, put it best:

C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.
Continue reading


CUDA Spotlight: GPU-Accelerated Agent-Based Simulation of Complex Systems

Paul-RichmondThis week’s Spotlight is on Dr. Paul Richmond, a Vice Chancellor’s Research Fellow at the University of Sheffield (a CUDA Research Center). Paul’s research interests relate to the simulation of complex systems and to parallel computer hardware.

The following is an excerpt from our interview (read the complete Spotlight here).

NVIDIA: Paul, tell us about FLAME GPU.
Paul: Agent-Based Simulation is a powerful technique used to assess and predict group behavior from a number of simple interacting rules between communicating autonomous individuals (agents). Individuals typically represent some biological entity such as a molecule, cell or organism and can therefore be used to simulate systems at varying biological scales.

The Flexible Large-scale Agent Modelling Environment for the GPU (FLAME GPU) is a piece of software which enables high level descriptions communicating agents to be automatically translated to GPU hardware. With FLAME GPU, simulation performance is enormously increased over traditional agent-based modeling platforms and interactive visualization can easily be achieved. The GPU architecture and the underlying software algorithms are abstracted from users of the FLAME GPU software, ensuring accessibility to users in a wide range of domains and application areas.

NVIDIA: How does FLAME GPU leverage GPU computing?
Paul: Unlike other agent-based simulation frameworks, FLAME GPU is designed from the ground up with parallelism in mind. As such it is possible to ensure that agents and behavior are mapped to the GPU efficiently in a way which minimizes data transfer during simulation. Continue reading


CUDACasts Episode 16: Thrust Algorithms and Custom Operators

Continuing the Thrust mini-series (see Part 1), today’s episode of CUDACasts focuses on a few of the algorithms that make Thrust a flexible and powerful parallel programming library. You’ll also learn how to use functors, or C++ “function objects”, to customize how Thrust algorithms process data.

In the next CUDACast in this Thrust mini-series, we’ll take a look at how fancy iterators increase the flexibility Thrust has for expressing parallel algorithms in C++.

Continue reading


CUDACasts Episode 15: Introduction to Thrust

Whenever I hear about a developer interested in accelerating his or her C++ application on a GPU, I make sure to tell them about Thrust. Thrust is a parallel algorithms library loosely based on the C++ Standard Template Library. Thrust provides a number of building blocks, such as sort, scans, transforms, and reductions, to enable developers to quickly embrace the power of parallel computing.  In addition to targeting the massive parallelism of NVIDIA GPUs, Thrust supports multiple system back-ends such as OpenMP and Intel’s Threading Building Blocks. This means that it’s possible to compile your code for different parallel processors with a simple flick of a compiler switch.

For this first in a mini-series of screencasts about Thrust, we’ll write a simple sorting program and execute it on both a GPU and a multi-core CPU.  In upcoming episodes, we’ll explore more capabilities of Thrust which really show its flexibility and power. For more examples of using Thrust, read the post Expressive Algorithmic Programming with Thrust, and check out the Thrust Quick Start Guide.

Continue reading


CUDA Spotlight: GPU-Accelerated Genomics

This week’s Spotlight is on Dr. Knut Reinert. Knut is a professor at Freie Universität in Berlin, Germany, and chair of the Algorithms in Bioinformatics group in the Institute of Computer Science. Knut and his team focus on the development of novel algorithms and data structures for problems in the analysis of biomedical mass data. In particular, the group develops mathematical models for analyzing large genomic sequences and data derived from mass spectrometry experiments (for example, for detecting differential expression of proteins between normal and diseased samples). Previously, Knut was at Celera Genomics, where he worked on bioinformatics algorithms and software for the Human Genome Project, which assembled the very first human genome.

On Oct. 22, 2013, Knut will deliver a GTC Express Webinar presentation titled: Intro to SeqAn, an Open-Source C++ Template Library.

Following is an excerpt from our interview (you can read the complete Spotlight here).

Knut Reinert, Freie Univ. Berlin
Knut Reinert, Freie Univ. Berlin

NVIDIA: Knut, tell us about the SeqAn library.
Knut: Before setting up the Algorithmic Bioinformatics group at Freie Universität, I had been working for years at a U.S. company – Celera Genomics in Maryland – where I worked on the assembly of both the Drosophila (fruit fly) and human genomes. A central part of these projects was the development of large software packages containing algorithms for assembly and genome analysis developed by the Informatics Research team at Celera.

Although successful, the endeavor clearly showed the lack of available implementations in sequence analysis, even for standard tasks. Implementations of much needed algorithmic components were either not available, or hard to access in third-party, monolithic software products.

With this experience in mind, and being educated at the Max-Planck Institute for Computer Science in Saarbrücken (the home of very successful software libraries like LEDA and CGAL) I put the development of such a software library high on my research agenda. Continue reading

Copperhead: Data Parallel Python

Programming environments like C and Fortran allow complete and unrestricted access to computing hardware, but often require programmers to understand the low-level details of the hardware they target. Although these efficiency-oriented systems are essential to every computing platform, many programmers prefer to use higher level programming environments like Python or Ruby, focused on productivity rather than absolute performance. Productivity-focused programmers solving large or intensive problems do need high performance, and many seek to exploit parallel computing, but without the costs of understanding low-level hardware details or programming directly to a particular machine.

Copperhead is a project that aims to enable productivity-focused programmers to take advantage of parallel computing, without explicitly coding to any particular machine. Copperhead programs use familiar Python primitives such as map and reduce, and they execute in parallel on both CUDA-enabled GPUs as well as multicore

Parallel Hello World: axpy

Let’s start with an example: below find Copperhead code for axpy, the “hello world” of parallel programs. (axpy is the type-generic form of saxpy. See Six Ways to SAXPY for more.)

from copperhead import *
import numpy as np

def axpy(a, x, y):
    return [a * xi + yi for xi, yi in zip(x, y)]

n = 1000000
a = 2.0
x = np.random.rand(n)
y = np.random.rand(n)
with places.gpu0:
    gpu_result = axpy(a, x, y)
with places.openmp:
    cpu_result = axpy(a, x, y)

Continue reading

Six Ways to SAXPY

This post is a GPU program chrestomathy. What’s a Chrestomathy, you ask?

In computer programming, a program chrestomathy is a collection of similar programs written in various programming languages, for the purpose of demonstrating differences in syntax, semantics and idioms for each language. [Wikipedia]

There are several good examples of program chrestomathies on the web, including Rosetta Code andNBabel, which demonstrates gravitational N-body simulation in multiple programming languages. In this post I demonstrate six ways to implement a simple SAXPY computation on the CUDA platform. Why is this interesting? Because it demonstrates the breadth of options you have today for programming NVIDIA GPUs, and it covers the three main approaches to GPU computing: GPU-accelerated libraries, GPU compiler directives, and GPU programming languages.

SAXPY stands for “Single-Precision A·X Plus Y”.  It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like this.

void saxpy(int n, float a, float *x, float *y)
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];

// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);

Continue reading

Expressive Algorithmic Programming with Thrust

Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust’s High-Level interface greatly enhances programmer Productivity while enabling performance portability between GPUs and multicore CPUs. Interoperability with established technologies (such as CUDA, TBB, and OpenMP) facilitates integration with existing software. Develop High-Performance applications rapidly with Thrust!

This excerpt from the Thrust home page perfectly summarizes the benefits of the Thrust library. Thrust enables expressive algorithmic programming via a vocabulary of parallel building blocks that let you rapidly develop fast, portable parallel algorithms. If you are a C++ programmer, and especially if you use template libraries like the STL and Boost C++ libraries, then you will find Thrust familiar. Like the STL, Thrust helps you focus on algorithms, rather than on platform-specific implementation details. At the same time, Thrust’s modular design allows low-level customization and interoperation with custom platform-specific code such as CUDA kernels and libraries.

Thrust is High-Level

As described in the article “Thrust, a Productivity-Oriented Library for CUDA”, Thrust aims to solve two types of problems: problems that can be “implemented efficiently without a detailed mapping to the target architecture”, and problems that don’t merit or won’t receive (for whatever reason) significant optimization attention from the programmer. High-level primitives make it easier to capture programmer intent; developers describe what to compute, without dictating how to compute it. This allows the library to make informed decisions about how to implement the intended computation.

Thrust provides an STL-style vector container (with host_vector and device_vector implementations), and a suite of high-level algorithms including searchingsortingcopyingmergingtransformingreordering,reducingprefix sums, and set operations. Here is an oft-repeated complete example program from the Thrust home page, which generates random numbers serially and then transfers them to the GPU where they are sorted.


int main(void)
  // generate 32M random numbers serially
  thrust::host_vector h_vec(32 << 20);
  std::generate(h_vec.begin(), h_vec.end(), rand);

  // transfer data to the device
  thrust::device_vector d_vec = h_vec;

  // sort data on the device
  thrust::sort(d_vec.begin(), d_vec.end());

  // transfer data back to host
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

  return 0;

Continue reading