5 Things You Should Know About the New Maxwell GPU Architecture

The introduction this week of NVIDIA’s first-generation “Maxwell” GPUs is a very exciting moment for GPU computing. These first Maxwell products, such as the GeForce GTX 750 Ti, are based on the GM107 GPU and are designed for use in low-power environments such as notebooks and small form factor computers. What is exciting about this announcement for HPC and other GPU computing developers is the great leap in energy efficiency that Maxwell provides: nearly twice that of the Kepler GPU architecture. That makes Maxwell a great architecture for future NVIDIA Tesla products, so stay tuned.

This post will tell you five things that you need to know about Maxwell as a GPU computing programmer, including high-level benefits of the architecture, specifics of the new Maxwell multiprocessor, guidance on tuning and pointers to more resources.


1. The Heart of Maxwell: More Efficient Multiprocessors

Maxwell introduces an all-new design for the Streaming Multiprocessor (SM) that dramatically improves power efficiency. Although the Kepler SMX design was extremely efficient for its generation, through its development NVIDIA’s GPU architects saw an opportunity for another big leap forward in architectural efficiency; the Maxwell SM is the realization of that vision. Improvements to control logic partitioning, workload balancing, clock-gating granularity, instruction scheduling, number of instructions issued per clock cycle, and many other enhancements allow the Maxwell SM (also called “SMM”) to far exceed Kepler SMX efficiency. The new Maxwell SM architecture enabled us to increase the number of SMs to five in GM107, compared to two in GK107, with only a 25% increase in die area.

Improved Instruction Scheduling

The number of CUDA Cores per SM has been reduced to a power of two, however with Maxwell’s improved execution efficiency, performance per SM is usually within 10% of Kepler performance, and the improved area efficiency of the SM means CUDA cores per GPU will be substantially higher versus comparable Fermi or Kepler chips. The Maxwell SM retains the same number of instruction issue slots per clock and reduces arithmetic latencies compared to the Kepler design. Continue reading


Faster Parallel Reductions on Kepler

Parallel reduction is a common building block for many parallel algorithms. A presentation from 2007 by Mark Harris provided a detailed strategy for implementing parallel reductions on GPUs, but this 6-year old document bears updating. In this post I will show you some features of the Kepler GPU architecture which make reductions even faster: the shuffle (SHFL) instruction and fast device memory atomic operations.

The source code for this post is available on Github.

Shuffle On Down

Efficient parallel reductions exchange data between threads within the same thread block. On earlier hardware this meant using shared memory, which involves writing data to shared memory, synchronizing, and then reading the data back from shared memory. Kepler’s shuffle instruction (SHFL) enables a thread to directly read a register from another thread in the same warp (32 threads). This allows threads in a warp to collectively exchange or broadcast data. As described in the post “Do the Kepler Shuffle”, there are four shuffle intrinsics: __shlf(), __shfl_down(), __shfl_up(), and __shfl_xor(), but in this post we only use __shfl_down(), defined as follows: (You can find a complete description of the other shuffle functions in the CUDA C Programming Guide.)

int __shfl_down(int var, unsigned int delta, int width=warpSize);

__shfl_down() calculates a source lane ID by adding delta to the caller’s lane ID (the lane ID is a thread’s index within its warp, from 0 to 31). The value of var held by the resulting lane ID is returned: this has the effect of shifting var down the warp by delta lanes. If the source lane ID is out of range or the source thread has exited, the calling thread’s own var is returned. The ID number of the source lane will not wrap around the value of width and so the upper delta lanes will remain unchanged. Note that width must be one of (2, 4, 8, 16, 32). For brevity, the diagrams that follow show only 8 threads in a warp even though the warp size of all current CUDA GPUs is 32.

As an example, Figure 1 shows the effect of the following two lines of code, where we can see that values are shifted down by 2 threads.

int i = threadIdx.x % 32;
int j = __shfl_down(i, 2, 8);
Figure 1: The shuffle down instruction.
Figure 1: The shuffle down instruction.

There are three main advantages to using shuffle instead of shared memory: Continue reading

In the Trenches at GTC: Inside Kepler

By Tomasz Bednarz (GTC 2012 Guest Blogger)

I had been eagerly anticipating the “Inside Kepler” session since GTC 2012 opened. On Wednesday arvo, May 16th, two NVIDIA blokes, Stephen Jones (CUDA Model Lead) and Lars Nyland (Senior Architect), warmly introduced the new Kepler GK110 GPU, proudly announcing that it is all about “performance, efficiency and programmability.”

Lars and Stephen revealed that GK110 has 7.1B transistors (wow!), 15 SMX units, >1 TFLOP fp64, 1.5MB L2 cache, 384-bit GDDR5 and PCI-Express Gen 3. NVIDIA looked high and low to reduce power consumption and increase performance, and after many months of tough design and testing in their labs, the NVIDIA mates emerged with an awesome GPU that greatly exceeds Fermi’s compute horsepower, while consuming less power and generating less heat. Corker of a GPU! Continue reading