CUDA Pro Tip: Fast and Robust Computation of Givens Rotations

A Givens rotation [1] represents a rotation in a plane represented by a matrix of the form

$G(i, j, \theta) = \begin{bmatrix} 1 & \cdots & 0 & \cdots & 0 & \cdots & 0 \\ \vdots & \ddots & \vdots & & \vdots & & \vdots \\ 0 & \cdots & c & \cdots & -s & \cdots & 0 \\ \vdots & & \vdots & \ddots & \vdots & & \vdots \\ 0 & \cdots & s & \cdots & c & \cdots & 0 \\ \vdots & & \vdots & & \vdots & \ddots & \vdots \\ 0 & \cdots & 0 & \cdots & 0 & \cdots & 1 \end{bmatrix}$ ,

where the intersections of the $i$ th and $j$ th columns contain the values $c=cos \theta$ and $s=sin \theta$ . Multiplying a vector $x$ by a Givens rotation matrix $G(i, j, \theta)$ represents a rotation of the vector $x$ in the $(i, j)$ plane by $\theta$ radians.

According to Wikipedia, the main use of Givens rotations in numerical linear algebra is to introduce zeros in vectors or matrices. Importantly, that means Givens rotations can be used to compute the QR decomposition of a matrix. An important advantage over Householder transformations is that Givens rotations are easy to parallelize.

Computing the rotation factors $s$ and $c$ is critical to both the performance and the accuracy of Givens rotation. For accuracy and robustness, one could use division by the hypotenuse (hypot()).

double t = hypot(x, y);
double c = x / t;
double s = y / t;

For performance, however, it would be better to use multiplication by the reciprocal square root. But this may lose robustness due to intermediate underflow or overflow in the computation of the sum of squares.

double t = rsqrt (x*x + y*y);
double c = x * t; 
double s = y * t;

Ideally, though, we want robustness, accuracy, and performance all at once. For this purpose, CUDA 6 introduces a new reciprocal hypotenuse function, rhypot(x,y), which computes $1 / \sqrt{x^2+y^2}$ . rhypot() is available in both single and double-precision versions.

Using rhypot, you can have the best of both worlds. You can rapidly compute Givens rotations without intermediate overflow or underflow, using the following code.

double t = rhypot(x, y);
double c = x * t;
double s = y * t;

rhypot() can also be used for fast and robust 2D vector normalization, as in the following code which uses a hypothetical vector class.

double t = rhypot(vec.x, vec.y);
vec.x *= t;
vec.y *= t;

Future releases of CUDA will further improve the accuracy and performance of rhypot(). Norbert Juffa contributed to this post.

[1]: Wallace Givens, Computation of Plane Unitary Rotations Transforming a General Matrix to Triangular Form, J. Soc. Indust. Appl. Math., Vol. 6, No. 1, March 1958, pp. 26-50