GPU-Accelerated Cosmological Analysis on the Titan Supercomputer

Ever looked up in the sky and wondered where it all came from? Cosmologists are in the same boat, trying to understand how the Universe arrived at the structure we observe today. They use supercomputers to follow the fate of very small initial fluctuations in an otherwise uniform density. As time passes, gravity causes the small fluctuations to grow, eventually forming the complex structures that characterize the current Universe. The numerical simulations use tracer particles representing lumps of matter to carry out the calculations. The distribution of matter at early times is known only in a statistical sense so we can’t predict exactly where galaxies will show up in the sky. But quite accurate predictions can be made for how the galaxies are distributed in space, even with relatively simplified simulations.

Figure 1: Visualization of the Q Continuum simulation generated with the vl3 parallel volume rendering system using a point sprite technique. Image courtesy of Silvio Rizzi and Joe Insley, Argonne National Laboratory.
Figure 1: Visualization of the Q Continuum simulation generated with the vl3 parallel volume rendering system using a point sprite technique. Image courtesy of Silvio Rizzi and Joe Insley, Argonne National Laboratory.

As observations become increasingly detailed, the simulations need to become more detailed as well, requiring huge amounts of simulation particles. Today, top-notch simulation codes like the Hardware/Hybrid Accelerated Cosmology Code (HACC ) [1] can follow more than half a trillion particles in a volume of more than 1 Gpc3 (1 cubic Gigaparsec. That’s a cube with sides 3.26 billion light years long) on the largest scale GPU-accelerated supercomputers like Titan at the Oak Ridge National Laboratory. The “Q Continuum” simulation [2] that Figure 1 shows is an example.

While simulating the Universe at this resolution is by itself a challenge, it is only half the job. The analysis of the simulation results is equally challenging. It turns out that GPUs can help with accelerating both the simulation and the analysis. In this post we’ll demonstrate how we use Thrust and CUDA to accelerate cosmological simulation and analysis and visualization tasks, and how we’re generalizing this work into libraries and toolkits for scientific visualization.

An object of great interest to cosmologists is the so-called “halo”. A halo is a high-density region hosting galaxies and clusters of galaxies, depending on the halo mass. The task of finding billions of halos and determining their centers is computationally demanding. Continue reading


The Power of C++11 in CUDA 7

Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone!

LambdaCUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself here. Instead, I wanted to take a deeper look at C++11 support in device code.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. New C++ language features include auto, lambda functions, variadic templates, static_assert, rvalue references, range-based for loops, and more. To enable C++11 support, pass the flag --std=c++11 to nvcc (this option is not required for Microsoft Visual Studio).

In my earlier CUDA 7 feature overview post, I presented a small example to show some C++11 features. Let’s dive into a somewhat expanded example to show the power of C++11 for CUDA programmers. This example will proceed top-down, covering a couple of layers of abstraction that allow us to write concise, reusable C++ code for the GPU, all enabled by C++11. The complete example is available on Github.

Let’s say we have a very specific (albeit contrived) goal: count the number of characters from a certain set within a text. (In parallel, of course!) Here’s a simple CUDA C++11 kernel that abstracts the mechanics of this a bit.

void xyzw_frequency(int *count, char *text, int n)
    const char letters[] { 'x','y','z','w' };

    count_if(count, text, n, [&](char c) {
        for (const auto x : letters) 
            if (c == x) return true;
        return false;

Continue reading


Accelerating Bioinformatics with NVBIO

NVBIO is an open-source C++ template library of high performance parallel algorithms and containers designed by NVIDIA to accelerate sequence analysis and bioinformatics applications. NVBIO has a threefold focus:

  1. Performance, providing a suite of state-of-the-art parallel algorithms that offer a significant leap in performance;
  2. Reusability, providing a suite of highly expressive and flexible template algorithms that can be easily configured and adjusted to the many different usage scenarios typical in bioinformatics;
  3. Portability, providing a completely cross-platform suite of tools, that can be easily switched from running on NVIDIA GPUs to multi-core CPUs by changing a single template parameter.

Exponential Parallelism

We built NVBIO because we believe only the exponentially increasing parallelism of many-core GPU architectures can provide the immense computational capability required by the exponentially increasing sequencing throughput.

There is a common misconception that GPUs only excel at highly regular, floating point intensive applications, but today’s GPUs are fully programmable parallel processors, offering superior memory bandwidth and latency hiding characteristics, and R&D efforts at NVIDIA and elsewhere have proved that they can be a perfect match even for branchy, integer-heavy bioinformatics applications. The caveat is that legacy applications need to be rethought for fine-grained parallelism.

Many CPU algorithms are designed to run on few cores and scale to a tiny number of threads. When the number of threads is measured in the thousands, rather than dozens—a fact that all applications inevitably must consider—applications must tackle fundamental problems related to load balancing, synchronization, and execution and memory divergence.

NVBIO does just that, providing both low-level primitives that can be used from either CPU/host or GPU/device threads, as well as novel, highly parallel high-level primitives designed to scale from the ground up. Continue reading


CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More

It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs.

Recently NVIDIA released the CUDA Toolkit version 5.5 with support for the IBM POWER architecture. Starting with CUDA 7, all future CUDA Toolkit releases will support POWER CPUs.

CUDA 7 is a huge update to the CUDA platform; there are too many new features and improvements to describe in one blog post, so I’ll touch on some of the most significant ones today. Please refer to the CUDA 7 release notes and documentation for more information. We’ll be covering many of these features in greater detail in future Parallel Forall posts, so check back often!

Support for Powerful C++11 Features

C++11 is a major update to the popular C++ language standard. C++11 includes a long list of new features for simpler, more expressive C++ programming with fewer errors and higher performance. I think Bjarne Stroustrup, the creator of C++, put it best:

C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.
Continue reading


CUDA Spotlight: GPU-Accelerated Agent-Based Simulation of Complex Systems

Paul-RichmondThis week’s Spotlight is on Dr. Paul Richmond, a Vice Chancellor’s Research Fellow at the University of Sheffield (a CUDA Research Center). Paul’s research interests relate to the simulation of complex systems and to parallel computer hardware.

NVIDIA: Paul, tell us about FLAME GPU.
Paul: Agent-Based Simulation is a powerful technique used to assess and predict group behavior from a number of simple interacting rules between communicating autonomous individuals (agents). Individuals typically represent some biological entity such as a molecule, cell or organism and can therefore be used to simulate systems at varying biological scales.

The Flexible Large-scale Agent Modelling Environment for the GPU (FLAME GPU) is a piece of software which enables high level descriptions communicating agents to be automatically translated to GPU hardware. With FLAME GPU, simulation performance is enormously increased over traditional agent-based modeling platforms and interactive visualization can easily be achieved. The GPU architecture and the underlying software algorithms are abstracted from users of the FLAME GPU software, ensuring accessibility to users in a wide range of domains and application areas.

NVIDIA: How does FLAME GPU leverage GPU computing?
Paul: Unlike other agent-based simulation frameworks, FLAME GPU is designed from the ground up with parallelism in mind. As such it is possible to ensure that agents and behavior are mapped to the GPU efficiently in a way which minimizes data transfer during simulation. Continue reading


CUDACasts Episode 16: Thrust Algorithms and Custom Operators

Continuing the Thrust mini-series (see Part 1), today’s episode of CUDACasts focuses on a few of the algorithms that make Thrust a flexible and powerful parallel programming library. You’ll also learn how to use functors, or C++ “function objects”, to customize how Thrust algorithms process data.

In the next CUDACast in this Thrust mini-series, we’ll take a look at how fancy iterators increase the flexibility Thrust has for expressing parallel algorithms in C++.

Continue reading


CUDACasts Episode 15: Introduction to Thrust

Whenever I hear about a developer interested in accelerating his or her C++ application on a GPU, I make sure to tell them about Thrust. Thrust is a parallel algorithms library loosely based on the C++ Standard Template Library. Thrust provides a number of building blocks, such as sort, scans, transforms, and reductions, to enable developers to quickly embrace the power of parallel computing.  In addition to targeting the massive parallelism of NVIDIA GPUs, Thrust supports multiple system back-ends such as OpenMP and Intel’s Threading Building Blocks. This means that it’s possible to compile your code for different parallel processors with a simple flick of a compiler switch.

For this first in a mini-series of screencasts about Thrust, we’ll write a simple sorting program and execute it on both a GPU and a multi-core CPU.  In upcoming episodes, we’ll explore more capabilities of Thrust which really show its flexibility and power. For more examples of using Thrust, read the post Expressive Algorithmic Programming with Thrust, and check out the Thrust Quick Start Guide.

Continue reading


CUDA Spotlight: GPU-Accelerated Genomics

This week’s Spotlight is on Dr. Knut Reinert. Knut is a professor at Freie Universität in Berlin, Germany, and chair of the Algorithms in Bioinformatics group in the Institute of Computer Science. Knut and his team focus on the development of novel algorithms and data structures for problems in the analysis of biomedical mass data. In particular, the group develops mathematical models for analyzing large genomic sequences and data derived from mass spectrometry experiments (for example, for detecting differential expression of proteins between normal and diseased samples). Previously, Knut was at Celera Genomics, where he worked on bioinformatics algorithms and software for the Human Genome Project, which assembled the very first human genome.

On Oct. 22, 2013, Knut will deliver a GTC Express Webinar presentation titled: Intro to SeqAn, an Open-Source C++ Template Library.

Knut Reinert, Freie Univ. Berlin
Knut Reinert, Freie Univ. Berlin

NVIDIA: Knut, tell us about the SeqAn library.
Knut: Before setting up the Algorithmic Bioinformatics group at Freie Universität, I had been working for years at a U.S. company – Celera Genomics in Maryland – where I worked on the assembly of both the Drosophila (fruit fly) and human genomes. A central part of these projects was the development of large software packages containing algorithms for assembly and genome analysis developed by the Informatics Research team at Celera.

Although successful, the endeavor clearly showed the lack of available implementations in sequence analysis, even for standard tasks. Implementations of much needed algorithmic components were either not available, or hard to access in third-party, monolithic software products.

With this experience in mind, and being educated at the Max-Planck Institute for Computer Science in Saarbrücken (the home of very successful software libraries like LEDA and CGAL) I put the development of such a software library high on my research agenda. Continue reading

Copperhead: Data Parallel Python

Programming environments like C and Fortran allow complete and unrestricted access to computing hardware, but often require programmers to understand the low-level details of the hardware they target. Although these efficiency-oriented systems are essential to every computing platform, many programmers prefer to use higher level programming environments like Python or Ruby, focused on productivity rather than absolute performance. Productivity-focused programmers solving large or intensive problems do need high performance, and many seek to exploit parallel computing, but without the costs of understanding low-level hardware details or programming directly to a particular machine.

Copperhead is a project that aims to enable productivity-focused programmers to take advantage of parallel computing, without explicitly coding to any particular machine. Copperhead programs use familiar Python primitives such as map and reduce, and they execute in parallel on both CUDA-enabled GPUs as well as multicore

Parallel Hello World: axpy

Let’s start with an example: below find Copperhead code for axpy, the “hello world” of parallel programs. (axpy is the type-generic form of saxpy. See Six Ways to SAXPY for more.)

from copperhead import *
import numpy as np

def axpy(a, x, y):
    return [a * xi + yi for xi, yi in zip(x, y)]

n = 1000000
a = 2.0
x = np.random.rand(n)
y = np.random.rand(n)
with places.gpu0:
    gpu_result = axpy(a, x, y)
with places.openmp:
    cpu_result = axpy(a, x, y)

Continue reading

Six Ways to SAXPY

This post is a GPU program chrestomathy. What’s a Chrestomathy, you ask?

In computer programming, a program chrestomathy is a collection of similar programs written in various programming languages, for the purpose of demonstrating differences in syntax, semantics and idioms for each language. [Wikipedia]

There are several good examples of program chrestomathies on the web, including Rosetta Code andNBabel, which demonstrates gravitational N-body simulation in multiple programming languages. In this post I demonstrate six ways to implement a simple SAXPY computation on the CUDA platform. Why is this interesting? Because it demonstrates the breadth of options you have today for programming NVIDIA GPUs, and it covers the three main approaches to GPU computing: GPU-accelerated libraries, GPU compiler directives, and GPU programming languages.

SAXPY stands for “Single-Precision A·X Plus Y”.  It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i]. A simple C implementation looks like this.

void saxpy(int n, float a, float *x, float *y)
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];

// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);

Continue reading