About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started GPGPU.org to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
cloud gears

Accelerating Hyperscale Datacenter Applications with Tesla GPUs

The internet has changed how people consume media. Rather than just watching television and movies, the combination of ubiquitous mobile devices, massive computation, and available internet bandwidth has led to an explosion in user-created content: users are recreating the internet, producing exabytes of content every day.

Exabytes of content produced daily

Periscope, a mobile application that lets users broadcast video to followers has 10 million users who broadcast over 40 years of video per day. Twitch, a popular game broadcasting service, revealed last month that 1.7 million users have live-streamed 7.5 billion minutes of content. China’s biggest search engine, Baidu, processes 6 billion queries per day, and 10% of those queries use speech. About 300 hours of video is uploaded to YouTube every minute. And just last week, Mark Zuckerberg shared that Facebook users view 8 billion videos every day—a number that has grown by a factor of 8 in about a year.

This massive scale of content requires massive amounts of processing, and due to the volume of media content involved, datacenter workloads are changing. Increasing resources are spent on video and image processing, resizing, transcoding, filtering and enhancement. Likewise, large-scale machine learning and deep learning techniques apply trained models to what’s known as “inference”, which applies trained models to tasks such as image classification, object detection, machine translation, and speech recognition.

new_platform_new_workloads Continue reading


Performance Portability from GPUs to CPUs with OpenACC

OpenACC gives scientists and researchers a simple and powerful way to accelerate scientific computing applications incrementally. The OpenACC API describes a collection of compiler directives to specify loops and regions of code in standard C, C++, and Fortran to be offloaded from a host CPU to an attached accelerator. OpenACC is designed for portability across operating systems, host CPUs, and a wide range of accelerators, including APUs, GPUs, and many-core coprocessors.

Register for a free online OpenACC training course!

And starting today, with the PGI Compiler 15.10 release, OpenACC enables performance portability between accelerators and multicore CPUs. The new PGI Fortran, C and C++ compilers for the first time allow OpenACC-enabled source code to be compiled for parallel execution on either a multicore CPU or a GPU accelerator. This capability provides tremendous flexibility for programmers, enabling applications to take advantage of multiple system architectures with a single version of the source code.PGI 15.10

“Our goal is to enable HPC developers to easily port applications across all major CPU and accelerator platforms with uniformly high performance using a common source code base,” said Douglas Miles, director of PGI Compilers & Tools at NVIDIA. “This capability will be particularly important in the race towards exascale computing in which there will be a variety of system architectures requiring a more flexible application programming approach.”

OpenACC Portable PerformanceAs the chart above shows, performance on multicore CPUs for HPC apps using MPI + OpenACC is equivalent to MPI + OpenMP code. Compiling and running the same code on a Tesla K80 GPU can provide large speedups. Continue reading

CUDA 7.5

Simple, Portable Parallel C++ with Hemi 2 and CUDA 7.5

The last two releases of CUDA have added support for the powerful new features of C++. In the post The Power of C++11 in CUDA 7 I discussed the importance of C++11 for parallel programming on GPUs, and in the post New Features in CUDA 7.5 I introduced a new experimental feature in the NVCC CUDA C++ compiler: support for GPU Lambda expressions. Lambda expressions, introduced in C++11, provide concise syntax for anonymous functions (and closures) that can be defined in line with their use, can be passed as arguments, and can capture variables from surrounding scopes. GPU Lambdas bring that power and convenience to writing GPU functions, letting you launch parallel work on the GPU almost as easily as writing a for loop.

In this post, I want to show you how modern C++ features combine to enable a higher-level, more portable approach to parallel programming for GPUs. To do so, I’ll show you Hemi 2, the second release of a simple open-source C++ library that I developed to explore approaches to portable parallel C++ programming. I have written before about Hemi on Parallel Forall, but Hemi 2 is easier to use, more portable, and more powerful.

hemi-logo-blogIntroducing Hemi 2

Hemi simplifies writing portable CUDA C/C++ code. With Hemi,

  • you can write parallel kernels like you write for loops—in line in your CPU code—and run them on your GPU;
  • you can launch C++ Lambda functions as GPU kernels;
  • you can easily write code that compiles and runs either on the CPU or GPU;
  • kernel launch configuration is automatic: details like thread block size and grid size are optimization details, rather than requirements.

With Hemi, parallel code for the GPU can be as simple as the parallel_for loop in the following code, which can also be compiled and run on the CPU.

void saxpy(int n, float a, const float *x, float *y)
  hemi::parallel_for(0, n, [=] HEMI_LAMBDA (int i) {
    y[i] = a * x[i] + y[i];

Hemi is BSD-licensed, open-source software, available on Github. Continue reading

CUDA 7.5

New Features in CUDA 7.5

Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details.

16-bit Floating Point (FP16) Data

CUDA 7.5 expands support for 16-bit floating point (FP16) data storage and arithmetic, adding new half and half2 datatypes and intrinsic functions for operating on them. 16-bit “half-precision” floating point types are useful in applications that can process larger datasets or gain performance by choosing to store and operate on lower-precision data. Some large neural network models, for example, may be constrained by available GPU memory; and some signal processing kernels (such as FFTs) are bound by memory bandwidth.

Many applications can benefit by storing data in half precision, and processing it in 32-bit (single) precision. At GTC 2015 in March, NVIDIA CEO Jen-Hsun Huang announced that future Pascal architecture GPUs will include full support for such “mixed precision” computation, with FP16 (half) computation at higher throughput than FP32 (single) or FP64 (double) .

With CUDA 7.5, applications can benefit by storing up to 2x larger models in GPU memory. Applications that are bottlenecked by memory bandwidth may get up to 2x speedup. And applications on Tegra X1 GPUs bottlenecked by FP32 computation may benefit from 2x faster computation on half2 data.

CUDA 7.5 provides 3 main FP16 features: Continue reading


GPU Pro Tip: Fast Great-Circle Distance Calculation in CUDA C++

This post demonstrates the practical utility of CUDA’s sinpi() and cospi() functions in the context of distance calculations on earth. With the advent of location-aware and geospatial applications and geographical information systems (GIS), these distance computations have become commonplace.

A great circle divides a sphere into two hemispheres.
A great circle divides a sphere into two hemispheres. Image: Jhbdel at en.wikipedia [CC BY-SA 3.0], via Wikimedia Commons
Wikipedia defines a great circle as

A great circle, also known as an orthodrome or Riemannian circle, of a sphere is the intersection of the sphere and a plane which passes through the center point of the sphere.

For almost any pair of points on the surface of a sphere, the shortest (surface) distance between these points is the path along the great circle between them. If you have ever flown from Europe to the west coast of North America and wondered why you passed over Greenland, your flight most likely followed a great circle path in order to conserve fuel. Continue reading


GPU Pro Tip: Lerp Faster in C++

Linear interpolation is a simple and fundamental numerical calculation prevalent in many fields. It’s so common in computer graphics that programmers often use the verb “lerp” to refer to linear interpolation, a function that’s built into all modern graphics hardware (often in multiple hardware units).

Linear Interpolation
Linear Interpolation (from Wikipedia)

You can enable linear interpolation (also known as linear filtering) on texture fetches in CUDA kernels. This hardware filtering uses a low-precision interpolant, so for this and other reasons it’s common to lerp in software.

The standard way to lerp is:

(1-t)*v0 + t*v1

Here’s a generic host/device function that performs a lerp:

template <typename T>
__host__ __device__
inline T lerp(T v0, T v1, T t) {
    return (1-t)*v0 + t*v1;

But we can do better. Continue reading


C++11 in CUDA: Variadic Templates

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. In my post “The Power of C++11 in CUDA 7” I covered some of the major new features of C++11, such as lambda functions, range-based for loops, and automatic type deduction (auto). In this post, I’ll cover variadic templates.

There are times when you need to write functions that take a variable number of arguments: variadic functions. To do this in a typesafe manner for polymorphic functions, you really need to take a variable number of types in a template. Before C++11, the only way to write variadic functions was with the ellipsis (...) syntax and the va_* facilities. These facilities did not enable type safety and can be difficult to use.

As an example, let’s say we want to abstract the launching of GPU kernels. In my case, I want to provide simpler launch semantics in the Hemi library. There are many cases where you don’t care to specify the number and size of thread blocks—you just want to run a kernel with “enough” threads to fully utilize the GPU, or to cover your data size. In that case we can let the library decide how to launch the kernel, simplifying our code. But to launch arbitrary kernels, we have to support arbitrary type signatures. Well, we can do that like this:

template <typename... Arguments>
void cudaLaunch(const ExecutionPolicy &p, 
                Arguments... args);

Here, Arguments... is a “type template parameter pack”. We can use it to refer to the type signature of our kernel function pointer f, and to the arguments of cudaLaunch. To do the same thing before C++11 (and CUDA 7) required providing multiple implementations of cudaLaunch, one for each number of arguments we wanted to support. That meant you had to limit the maximum number of arguments allowed, as well as the amount of code you had to maintain. In my experience this was prone to bugs. Here’s the implementation of cudaLaunch. Continue reading


The Power of C++11 in CUDA 7

Today I’m excited to announce the official release of CUDA 7, the latest release of the popular CUDA Toolkit. Download the CUDA Toolkit version 7 now from CUDA Zone!

LambdaCUDA 7 has a huge number of improvements and new features, including C++11 support, the new cuSOLVER library, and support for Runtime Compilation. In a previous post I told you about the features of CUDA 7, so I won’t repeat myself here. Instead, I wanted to take a deeper look at C++11 support in device code.

CUDA 7 adds C++11 feature support to nvcc, the CUDA C++ compiler. This means that you can use C++11 features not only in your host code compiled with nvcc, but also in device code. New C++ language features include auto, lambda functions, variadic templates, static_assert, rvalue references, range-based for loops, and more. To enable C++11 support, pass the flag --std=c++11 to nvcc (this option is not required for Microsoft Visual Studio).

In my earlier CUDA 7 feature overview post, I presented a small example to show some C++11 features. Let’s dive into a somewhat expanded example to show the power of C++11 for CUDA programmers. This example will proceed top-down, covering a couple of layers of abstraction that allow us to write concise, reusable C++ code for the GPU, all enabled by C++11. The complete example is available on Github.

Let’s say we have a very specific (albeit contrived) goal: count the number of characters from a certain set within a text. (In parallel, of course!) Here’s a simple CUDA C++11 kernel that abstracts the mechanics of this a bit.

void xyzw_frequency(int *count, char *text, int n)
    const char letters[] { 'x','y','z','w' };

    count_if(count, text, n, [&](char c) {
        for (const auto x : letters) 
            if (c == x) return true;
        return false;

Continue reading


GPU Pro Tip: CUDA 7 Streams Simplify Concurrency

Heterogeneous computing is about efficiently using all processors in the system, including CPUs and GPUs. To do this, applications must execute functions concurrently on multiple processors. CUDA Applications manage concurrency by executing asynchronous commands in streams, sequences of commands that execute in order. Different streams may execute their commands concurrently or out of order with respect to each other. [See the post How to Overlap Data Transfers in CUDA C/C++ for an example]

When you execute asynchronous CUDA commands without specifying a stream, the runtime uses the default stream. Before CUDA 7, the default stream is a special stream which implicitly synchronizes with all other streams on the device.

CUDA 7 introduces a ton of powerful new functionality, including a new option to use an independent default stream for every host thread, which avoids the serialization of the legacy default stream. In this post I’ll show you how this can simplify achieving concurrency between kernels and data copies in CUDA programs.
Continue reading


CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More

It’s almost time for the next major release of the CUDA Toolkit, so I’m excited to tell you about the CUDA 7 Release Candidate, now available to all CUDA Registered Developers. The CUDA Toolkit version 7 expands the capabilities and improves the performance of the Tesla Accelerated Computing Platform and of accelerated computing on NVIDIA GPUs.

Recently NVIDIA released the CUDA Toolkit version 5.5 with support for the IBM POWER architecture. Starting with CUDA 7, all future CUDA Toolkit releases will support POWER CPUs.

CUDA 7 is a huge update to the CUDA platform; there are too many new features and improvements to describe in one blog post, so I’ll touch on some of the most significant ones today. Please refer to the CUDA 7 release notes and documentation for more information. We’ll be covering many of these features in greater detail in future Parallel Forall posts, so check back often!

Support for Powerful C++11 Features

C++11 is a major update to the popular C++ language standard. C++11 includes a long list of new features for simpler, more expressive C++ programming with fewer errors and higher performance. I think Bjarne Stroustrup, the creator of C++, put it best:

C++11 feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.
Continue reading