cuda_pro_tip

CUDA Pro Tip: Optimized Filtering with Warp-Aggregated Atomics

In this post, I’ll introduce warp-aggregated atomics, a useful technique to improve performance when many threads atomically add to a single counter. In warp aggregation, the threads of a warp first compute a total increment among themselves, and then elect a single thread to atomically add the increment to a global counter. This aggregation reduces the number of atomics performed by up to the number of threads in a warp (up to 32x on current GPUs), and can dramatically improve performance. Moreover, in many typical cases, you can implement warp aggregation as a drop-in replacement for standard atomic operations, so it is useful as a simple way to improve performance of complex applications.

Problem: Filtering by a Predicate

Let’s consider the following problem, which we call filtering: we have a source array, src, containing n elements, and a predicate, and we need to copy all elements of src satisfying the predicate into the destination array, dst. For the sake of simplicity, assume that dst has length of at least n and that the order of elements in the dst array does not matter. For our example, we assume that the array elements are integers, and the predicate is true if and only if the element is positive. Here is a sample CPU implementation of filtering.

int filter(int *dst, const int *src, int n) {
  int nres = 0;
  for (int i = 0; i < n; i++)
    if (src[i] > 0)
      dst[nres++] = src[i];
  // return the number of elements copied
  return nres;
}

Filtering, also known as stream compaction, is a common operation, and it is a part of the standard libraries of many programming languages, where it goes under a variety of names, including grep, copy_if, select, FindAll and so on. It is also very often implemented simply as a loop, as it may be very tightly integrated with the surrounding code. Continue reading

maxwell_thumb

Maxwell: The Most Advanced CUDA GPU Ever Made

Today NVIDIA introduced the new GM204 GPU, based on the Maxwell architecture. GM204 is the first GPU based on second-generation Maxwell, the full realization of the Maxwell architecture. The GeForce GTX 980 and 970 GPUs introduced today are the most advanced gaming and graphics GPUs ever made. But of course they also make fantastic CUDA development GPUs, with full support for CUDA 6.5 and all of the latest features of the CUDA platform, including Unified Memory and Dynamic Parallelism.

GM204′s 16 SMs make it over 3 times faster than the first-generation GM107 GPU that I introduced earlier this year on Parallel Forall, and additional architectural improvements help GM204 pack an even bigger punch.

SMM: The Maxwell Multiprocessor

GeForce_GTX_980_SM_Diagram
Figure 1: Maxwell’s Multiprocessor, SMM.

As I discussed in my earlier Maxwell post, the heart of Maxwell’s power-efficient performance is it’s Streaming Multiprocessor, known as SMM. Maxwell’s new datapath organization and improved instruction scheduler provide more than 40% higher delivered performance per CUDA core, and overall twice the efficiency of Kepler GK104. The new SMM, shown in Figure 1, includes all of the architectural benefits of its first-generation Maxwell predecessor, including improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and more.  

SMM uses a quadrant-based design with four 32-core processing blocks each with a dedicated warp scheduler capable of dispatching two instructions per clock. Each SMM provides eight texture units, one polymorph engine (geometry processing for graphics), and dedicated register file and shared memory.

Continue reading

5 Things You Should Know About the New Maxwell GPU Architecture

[Be sure to check out Maxwell: The Most Advanced CUDA GPU Ever Made, a newer post about the second-generation Maxwell GPU architecture.]

The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such as notebooks and small form factor computers. What is exciting about this announcement for HPC and other GPU computing developers is the great leap in energy efficiency that Maxwell provides: nearly twice that of the Kepler GPU architecture.

This post will tell you five things that you need to know about Maxwell as a GPU computing programmer, including high-level benefits of the architecture, specifics of the new Maxwell multiprocessor, guidance on tuning and pointers to more resources.

maxwell_vs_kepler_power_efficiency

1. The Heart of Maxwell: More Efficient Multiprocessors

Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves power efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.

Improved Instruction Scheduling

The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell’s improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of the SM means CUDA cores per GPU will be substantially higher versus comparable Fermi or Kepler chips. The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design. Continue reading