Numerical Stability and Floating Point Arithmetic Cheat Sheet

Updated 2026-05-21

Next Topic: Probability in Statistics Cheat Sheet

Floating-point arithmetic is the universal lingua franca of scientific, financial, and machine learning computation, governed by the IEEE 754 standard that defines how real numbers are approximated in binary. The gap between mathematical real numbers and their finite-bit representations introduces rounding errors that, left unmanaged, can render results meaningless — a subtraction of two nearly equal numbers can annihilate every significant digit in a phenomenon called catastrophic cancellation. The key mental model: every floating-point number has a neighborhood of representable values, and each arithmetic operation rounds the exact result to the nearest neighbor — so numerical stability is really the art of keeping those rounding errors from compounding into errors that dwarf the answer you care about.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 99 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: IEEE 754 Format Families and Bit LayoutsTable 2: Special Values — NaN, Infinity, Signed ZeroTable 3: Subnormal (Denormal) NumbersTable 4: Machine Epsilon, ULP, and Representational LimitsTable 5: Rounding ModesTable 6: Catastrophic Cancellation and Loss of SignificanceTable 7: Forward and Backward Error Analysis, Condition NumberTable 8: Comparing Floating-Point NumbersTable 9: Compensated and Accurate Summation AlgorithmsTable 10: Log-Sum-Exp Trick and Stable SoftmaxTable 11: Welford's Algorithm and Stable VarianceTable 12: Fused Multiply-Add (FMA) and Error-Free ArithmeticTable 13: Floating-Point Formats for Machine Learning and GPUsTable 14: Fast Inverse Square Root and Bit-Level TricksTable 15: Interval Arithmetic BasicsTable 16: Common Pitfalls and Best Practices

Table 1: IEEE 754 Format Families and Bit Layouts

Understanding the bit layout of each IEEE 754 format is the prerequisite for everything else: how much range you get, how much precision, and what special encodings mean. The sign–exponent–significand decomposition is identical across all formats, but the field widths determine the trade-offs.

Format	Example	Description
binary32 (single precision)	1 sign, 8 exp, 23 significand bits → 32 bits total	• Approx. $\pm 3.4 \times 10^{38}$ range, ~7 decimal digits of precision • bias = 127 • stored as `float` in C/Java
binary64 (double precision)	1 sign, 11 exp, 52 significand bits → 64 bits total	• Approx. $\pm 1.8 \times 10^{308}$ range, ~15–16 decimal digits • bias = 1023 • stored as `double` in C/Java and default in Python/NumPy
binary16 (half precision)	1 sign, 5 exp, 10 significand bits → 16 bits total	• Approx. ±65,504 range, ~3 decimal digits • used in GPU shader stages, ML inference, and texture compression
80-bit extended precision (x87)	1 sign, 15 exp, 64 significand bits (explicit integer bit)	• Intel/AMD x87 FPU format • satisfies IEEE 754 "binary64 extended" requirements • bias = 16383 • stores results of intermediate calculations on x87 stack with extra precision

Table 1: IEEE 754 Format Families and Bit Layouts

Format	Example	Description
binary32 (single precision)	1 sign, 8 exp, 23 significand bits → 32 bits total	• Approx. $\pm 3.4 \times 10^{38}$ range, ~7 decimal digits of precision • bias = 127 • stored as `float` in C/Java
binary64 (double precision)	1 sign, 11 exp, 52 significand bits → 64 bits total	• Approx. $\pm 1.8 \times 10^{308}$ range, ~15–16 decimal digits • bias = 1023 • stored as `double` in C/Java and default in Python/NumPy
binary16 (half precision)	1 sign, 5 exp, 10 significand bits → 16 bits total	• Approx. ±65,504 range, ~3 decimal digits • used in GPU shader stages, ML inference, and texture compression
80-bit extended precision (x87)	1 sign, 15 exp, 64 significand bits (explicit integer bit)	• Intel/AMD x87 FPU format • satisfies IEEE 754 "binary64 extended" requirements • bias = 16383 • stores results of intermediate calculations on x87 stack with extra precision