Optimizing the High Performance Conjugate Gradient Benchmark on GPUs

[This post was co-written by Everett Phillips and Massimiliano Fatica.]

The High Performance Conjugate Gradient Benchmark (HPCG) is a new benchmark intended to complement the High-Performance Linpack (HPL) benchmark currently used to rank supercomputers in the TOP500 list. This new benchmark solves a large sparse linear system using a multigrid preconditioned conjugate gradient (PCG) algorithm. The PCG algorithm better represents the computational and communication patterns prevalent in modern application workloads which rely more heavily on memory system and network performance than HPL.

GPU-accelerated supercomputers have proven to be very effective for accelerating compute-intensive applications like HPL, especially in terms of power efficiency. Obtaining good acceleration on the GPU for the HPCG benchmark is more challenging due to the limited parallelism and memory access patterns of the computational kernels involved. In this post we present the steps taken to obtain high performance of the HPCG benchmark on GPU-accelerated clusters, and demonstrate that our GPU-accelerated HPCG results are the fastest per-processor results reported to date.

The PCG Algorithm

The PCG algorithm solves a sparse linear system $\mathbf{A}\mathbf{x} = \mathbf{b}$ given an initial guess $\mathbf{x}_0$. The particular sparse linear system used in HPCG is a simple elliptic partial differential equation discretized with a 27-point stencil on a regular 3D grid. Rows in the sparse matrix $\mathbf{A}$ represent points in the grid. Each processor is responsible for a subset of rows corresponding to a local domain of size $N_{x} \times N_{y} \times N_{z}$, chosen by the user in the setup file. The number of processors is automatically detected at runtime, and decomposed into $P_{x} \times P_{y} \times P_{z}$, where $P=P_{x}P_{y}P_{z}$ is the total number of processors. This creates a global domain $G_{x} \times G_{y} \times G_{z}$, where $G_{x} = P_{x}N_{x}, G_{y} = P_{y}N_{y},$ and $G_{z} = P_{z}N_{z}$.  Although the matrix has a simple structure, it is only intended to facilitate the problem setup and validation of the solution. Implementations may not use assumptions about the matrix structure to optimize the solver; they must treat the matrix as a general sparse matrix.

Following is pseudocode for the PCG algorithm. Continue reading

CUDA Pro Tip: Fast and Robust Computation of Givens Rotations

A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form

$G(i, j, \theta) = \begin{bmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & c & \cdots & -s & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & s & \cdots & c & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{bmatrix}$,

where the intersections of the $i$th and $j$th columns contain the values $c=cos \theta$ and $s=sin \theta$. Multiplying a vector $x$ by a Givens rotation matrix $G(i, j, \theta)$ represents a rotation of the vector $x$ in the $(i, j)$ plane by $\theta$ radians.

According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in vectors or matrices. Importantly, that means Givens rotations can be used to compute the QR decomposition of a matrix. An important advantage over Householder transformations is that Givens rotations are easy to parallelize. Continue reading