CUDA Pro Tip: Flush Denormals with Confidence

I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point representation, I recommend starting with the Wikipedia entry on floating point, and for more detail about NVIDIA GPU floating point, check out this excellent white paper. The Wikipedia entry on denormal numbers is a good start for this post, so I’ll begin by paraphrasing it.

What’s a denormal?

Normal floating point values have no leading zeros in the mantissa (or significand). The mantissa is normalized and any leading zeros are moved into the exponent. Subnormal numbers (or denormal numbers) are floating point numbers where this normalized representation would result in an exponent that is too small (not representable).  So unlike normal floating point numbers, subnormal numbers have leading zeros in the mantissa. Doing this loses significant digits, but not as much as if the mantissa is flushed to zero on underflow. This allows what is known as “gradual underflow” when a result is very small, and helps avoid catastrophic division-by-zero errors.

Denormal numbers can incur extra computational cost. The Wikipedia entry explains that some platforms implement denormal numbers in software, while others handle them in hardware. On NVIDIA GPUs starting with the Fermi architecture (Compute Capability 2.0 and higher), denormal numbers are handled in hardware for operations that go through the Fused Multiply-Add pipeline (including adds and multiplies), and so these operations don’t carry performance overhead. But multi-instruction sequences such as square root and (notably for my example later in this post) reciprocal square root, must do extra work and take a slower path for denormal values (including the cost of detecting them).

What if I don’t need denormals?

Some apps simply don’t encounter or generate denormal numbers. The Wikipedia entry provides one example.

For instance, in audio processing applications, denormal values usually represent a signal so quiet that it is out of the human hearing range. Because of this, a common measure to avoid denormals on processors where there would be a performance penalty is to cut the signal to zero once it reaches denormal levels or mix in an extremely quiet noise signal.

If you are mixing in an extremely quiet noise signal (larger than denormal magnitude), the signal magnitude is guaranteed to never be subnormal. So why pay the overhead of supporting denormal numbers?

Another example is one that is near and dear to my heart: astrophysical n-body simulation. In n-body simulation, at each simulation step we must calculate the distance between each pair of bodies (stars, for example) that we are simulating. The distance calculation in the inner loop is a significant computational bottleneck. Here is the distance calculation from the CUDA “nbody” code sample included with CUDA Toolkit.

r.x = bj.x - bi.x; // bi and bj are the 3D positions of bodies i and j
r.y = bj.y - bi.y;
r.z = bj.z - bi.z;

// distSqr = dot(r_ij, r_ij)
float distSqr = r.x * r.x + r.y * r.y + r.z * r.z;
distSqr += softeningSquared; // "softening" factor 

float invDist = rsqrtf(distSqr);
float invDistCube =  invDist * invDist * invDist;

There are two things to notice in this code.  First is the use of rsqrtf(), which is provided by the CUDA math library, and maps to a reciprocal square root hardware instruction. Second is the addition of the value softeningSquared to the squared distance before calling rsqrtf(). This small constant “softening factor” is the astrophysical analog of the “extremely quiet noise signal” from Wikipedia’s audio processing example. Numerically, it guarantees that the squared distance is never denormal or zero (important because it is zero when computing the distance between a body and itself (i == j)). Physically, it precludes collisions, and can affect accuracy of some astrophysical simulations (in the various physical cases where this accuracy matters, an if (i != j) branch around the force computation is typically used instead of employing a softening factor).

By default, rsqrtf() must generate not only a single hardware instruction, but also extra code to handle the special case of denormal numbers. If we are using softening in our simulation, then we can be confident that denormals never occur in the distance calculation. So again, why pay the overhead of supporting denormal numbers? The answer is, of course, that you should not.

Flush To Zero

nvcc (the CUDA C/C++ compiler) provides the command-line option “-ftz=true” which causes all denormalized numbers to be flushed to zero. (Note: more details about floating point performance-related compiler options can be found in the CUDA Best Practices Guide—Math Libraries Section, and in the white paper linked above). For adds, multiplies and multiply-adds, the compiler appends a modifier flag to the instruction, which has no performance effect. But for functions like rsqrtf() which map to hardware functionality that does not support denormal numbers, enabling flush-to-zero can have a significant performance effect. To demonstrate, I ran the n-body sample code compiled with and without “-ftz=true“.

Results

First, to show that this option has no effect on the accuracy of the simulation (due to the use of a softening factor), I compiled with and without “-ftz=true” and compared the maximum error compared to a sequential CPU implementation for a small test set.

$ ./nbody -qatest
...
> Compute 3.5 CUDA device: [Tesla K20c]
Maximum Error: 3.358126e-04

$ ./nbody-ftz -qatest
...
> Compute 3.5 CUDA device: [Tesla K20c]
Maximum Error: 3.358126e-04

As you can see, the results are identical. (The executable “nbody-ftz” is compiled with “-ftz=true“). To test performance, I ran nbody with the -benchmark option and 212,992 bodies (I chose this size to saturate performance on the GK110-based Tesla K20c GPU with 13 SMs).

$nbody -benchmark -numbodies=212992
...
> Compute 3.5 CUDA device: [Tesla K20c]
number of bodies = 212992
212992 bodies, total time for 10 iterations: 5888.944 ms
= 77.035 billion interactions per second
= 1540.704 single-precision GFLOP/s at 20 flops per interaction

$nbody-ftz -benchmark -numbodies=212992
...
> Compute 3.5 CUDA device: [Tesla K20c]
number of bodies = 212992
212992 bodies, total time for 10 iterations: 4941.944 ms
= 91.797 billion interactions per second
= 1835.941 single-precision GFLOP/s at 20 flops per interaction

For this application, flushing denormals to zero is very beneficial, providing nearly a 20% speedup, with no impact on accuracy. If your application uses operations such as sqrt() and it does not generate denormal floating point numbers, then you should try enabling “-ftz=true“.

It pays to understand the numerical characteristics of your code!