CUDA Pro Tip: Flush Denormals with Confidence

I want to keep this post fairly brief, so I will only give minimal background on floating point numbers. If you need a refresher on floating point representation, I recommend starting with the Wikipedia entry on floating point, and for more detail about NVIDIA GPU floating point, check out this excellent white paper. The Wikipedia entry on denormal numbers is a good start for this post, so I’ll begin by paraphrasing it.

What’s a denormal?

Normal floating point values have no leading zeros in the mantissa (or significand). The mantissa is normalized and any leading zeros are moved into the exponent. Subnormal numbers (or denormal numbers) are floating point numbers where this normalized representation would result in an exponent that is too small (not representable).  So unlike normal floating point numbers, subnormal numbers have leading zeros in the mantissa. Doing this loses significant digits, but not as much as if the mantissa is flushed to zero on underflow. This allows what is known as “gradual underflow” when a result is very small, and helps avoid catastrophic division-by-zero errors.

Denormal numbers can incur extra computational cost. The Wikipedia entry explains that some platforms implement denormal numbers in software, while others handle them in hardware. On NVIDIA GPUs starting with the Fermi architecture (Compute Capability 2.0 and higher), denormal numbers are handled in hardware for operations that go through the Fused Multiply-Add pipeline (including adds and multiplies), and so these operations don’t carry performance overhead. But multi-instruction sequences such as square root and (notably for my example later in this post) reciprocal square root, must do extra work and take a slower path for denormal values (including the cost of detecting them). Continue reading

serial summation

Everything You Ever Wanted to Know About Floating Point but Were Afraid to Ask

This post was written by Nathan Whitehead

serial summationA few days ago a friend came to me with a question about floating point.  Let me start by saying that my friend knows his stuff, he doesn’t ask stupid questions.  So he had my attention.  He was working on some biosciences simulation code and was getting answers of a different precision than he expected on the GPU  and wanted to know what was up.

Even expert CUDA programmers don’t always know all the intricacies of floating point.  It’s a tricky topic.  Even my friend, who is so cool he wears sunglasses indoors, needed some help.  If you look at the NVIDIA CUDA forums, questions and concerns about floating point come up regularly.  [1] [2] [3] [4] [5] [6] [7]  Getting a handle on how to effectively use floating point is obviously very important if you are doing numeric computations in CUDA.

In an attempt to help out, Alex and I have written a short whitepaper about floating point on NVIDIA GPUs.

In the paper we talk about various issues related to floating point in CUDA.  You will learn:

  • How the IEEE 754 standard fits in with NVIDIA GPUs
  • How fused multiply-add improves accuracy
  • There’s more than one way to compute a dot product (we present three)
  • How to make sense of different numerical results between CPU and GPU