Model quantization reduces the precision of neural network parameters from high-bit representations (typically 32-bit or 16-bit floating point) to lower-bit formats (such as 8-bit or 4-bit integers), enabling faster inference, reduced memory footprint, and lower computational costs while maintaining acceptable accuracy. This technique has become essential for deploying large language models (LLMs) and deep learning models on resource-constrained devices, from edge hardware to consumer GPUs. Understanding the trade-offs between quantization granularity (per-tensor, per-channel, per-group), calibration methods (min-max, entropy-based, percentile), and algorithmic approaches (post-training vs quantization-aware training) is critical for practitioners seeking to optimize model deployment without sacrificing performance beyond acceptable thresholds. Recent advances — including NVIDIA's NVFP4 format for Blackwell GPUs, rotation-based methods like QuaRot and SpinQuant for full W4A4KV4 quantization, and Microsoft's BitNet b1.58 enabling ternary-weight inference on CPUs — have significantly expanded the frontier of practical low-bit deployment.
What This Cheat Sheet Covers
This topic spans 17 focused tables and 91 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Fundamental Quantization Types
The five primary paradigms define when and how quantization is applied — before inference, during training, or natively at training time. Choosing the right paradigm is the first and most consequential decision, as it determines available accuracy-efficiency trade-offs and hardware requirements.
| Method | Example | Description |
|---|---|---|
Convert trained FP32 model to INT8 using calibration dataset | • Applies quantization after training is complete using a calibration dataset • faster and requires no retraining but may lose more accuracy than QAT | |
Simulate quantization during training with fake-quantization nodes | • Simulates quantization effects during training by inserting fake-quantization operations in the forward pass • recovers accuracy loss better than PTQ but requires retraining | |
Quantize model weights to INT4, keep activations in FP16 | • Quantizes only the weights to lower precision while keeping activations in higher precision • reduces memory bandwidth and model size with minimal accuracy loss |