CUDA Pro Tip: Fast and Robust Computation of Givens Rotations

A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form

G(i, j, \theta) =  \begin{bmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\  \vdots & \ddots & \vdots & & \vdots & & \vdots \\  0 & \cdots & c & \cdots & -s & \cdots & 0 \\  \vdots & & \vdots & \ddots & \vdots & & \vdots \\  0 & \cdots & s & \cdots & c & \cdots & 0 \\  \vdots & & \vdots & & \vdots & \ddots & \vdots \\  0 & \cdots & 0 & \cdots & 0 & \cdots & 1  \end{bmatrix},

where the intersections of the ith and jth columns contain the values c=cos \theta and s=sin \theta. Multiplying a vector x by a Givens rotation matrix G(i, j, \theta) represents a rotation of the vector x in the (i, j) plane by \theta radians.

According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in vectors or matrices. Importantly, that means Givens rotations can be used to compute the QR decomposition of a matrix. An important advantage over Householder transformations is that Givens rotations are easy to parallelize.

Computing the rotation factors s and c is critical to both the performance and the accuracy of Givens rotation. For accuracy and robustness, one could use division by the hypotenuse (hypot()).

double t = hypot(x, y);
double c = x / t;
double s = y / t;

For performance, however, it would be better to use multiplication by the reciprocal square root. But this may lose robustness due to intermediate underflow or overflow in the computation of the sum of squares.

double t = rsqrt (x*x + y*y);
double c = x * t; 
double s = y * t;

Ideally, though, we want robustness, accuracy, and performance all at once. For this purpose, CUDA 6 introduces a new reciprocal hypotenuse function, rhypot(x,y), which computes 1 / \sqrt{x^2+y^2}. rhypot() is available in both single and double-precision versions.

Using rhypot, you can have the best of both worlds. You can rapidly compute Givens rotations without intermediate overflow or underflow, using the following code.

double t = rhypot(x, y);
double c = x * t;
double s = y * t;

rhypot() can also be used for fast and robust 2D vector normalization, as in the following code which uses a hypothetical vector class.

double t = rhypot(vec.x, vec.y);
vec.x *= t;
vec.y *= t;

Future releases of CUDA will further improve the accuracy and performance of rhypot(). Norbert Juffa contributed to this post.

[1]: Wallace Givens, Computation of Plane Unitary Rotations Transforming a General Matrix to Triangular Form, J. Soc. Indust. Appl. Math., Vol. 6, No. 1, March 1958, pp. 26-50


About Mark Harris

Mark is Chief Technologist for GPU Computing Software at NVIDIA. Mark has fifteen years of experience developing software for GPUs, ranging from graphics and games, to physically-based simulation, to parallel algorithms and high-performance computing. Mark has been using GPUs for general-purpose computing since before they even supported floating point arithmetic. While a Ph.D. student at UNC he recognized this nascent trend and coined a name for it: GPGPU (General-Purpose computing on Graphics Processing Units), and started to provide a forum for those working in the field to share and discuss their work. Follow @harrism on Twitter
  • Doug Enright

    One of the first investigations of the performance of a Fast Givens Rotation algorithm to calculate a QR factorization can be found in, “Benchmarking the NVIDIA 8800GTX with the CUDA Development Platform” (McGraw-Herdeg, et al, 2007 MIT/Lincoln Labs HPEC Workshop). QR factorization is part of the HPEC Challenge Benchmark Suite. Having a reciprocal hypotenuse function will be of great benefit to a lot of signal and image processing applications. The HPEC abstract can be found here:

  • Opperdienaar

    Cool stuff. Any plans to make a complex given rotation implementation?