Floating-point arithmetic is the universal lingua franca of scientific, financial, and machine learning computation, governed by the IEEE 754 standard that defines how real numbers are approximated in binary. The gap between mathematical real numbers and their finite-bit representations introduces rounding errors that, left unmanaged, can render results meaningless β a subtraction of two nearly equal numbers can annihilate every significant digit in a phenomenon called catastrophic cancellation. The key mental model: every floating-point number has a neighborhood of representable values, and each arithmetic operation rounds the exact result to the nearest neighbor β so numerical stability is really the art of keeping those rounding errors from compounding into errors that dwarf the answer you care about.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 99 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: IEEE 754 Format Families and Bit Layouts
Understanding the bit layout of each IEEE 754 format is the prerequisite for everything else: how much range you get, how much precision, and what special encodings mean. The signβexponentβsignificand decomposition is identical across all formats, but the field widths determine the trade-offs.
| Format | Example | Description |
|---|---|---|
1 sign, 8 exp, 23 significand bits β 32 bits total | β’ Approx. \pm 3.4 \times 10^{38} range, ~7 decimal digits of precisionβ’ bias = 127 β’ stored as float in C/Java | |
1 sign, 11 exp, 52 significand bits β 64 bits total | β’ Approx. \pm 1.8 \times 10^{308} range, ~15β16 decimal digitsβ’ bias = 1023 β’ stored as double in C/Java and default in Python/NumPy | |
1 sign, 5 exp, 10 significand bits β 16 bits total | β’ Approx. Β±65,504 range, ~3 decimal digits β’ used in GPU shader stages, ML inference, and texture compression | |
1 sign, 15 exp, 64 significand bits (explicit integer bit) | β’ Intel/AMD x87 FPU format β’ satisfies IEEE 754 "binary64 extended" requirements β’ bias = 16383 β’ stores results of intermediate calculations on x87 stack with extra precision |