Model quantization reduces the precision of neural network parameters from high-bit representations (typically 32-bit or 16-bit floating point) to lower-bit formats (such as 8-bit or 4-bit integers), enabling faster inference, reduced memory footprint, and lower computational costs while maintaining acceptable accuracy. This technique has become essential for deploying large language models (LLMs) and deep learning models on resource-constrained devices, from edge hardware to consumer GPUs. Understanding the trade-offs between quantization granularity (per-tensor, per-channel, per-group), calibration methods (min-max, entropy-based, percentile), and algorithmic approaches (post-training vs quantization-aware training) is critical for practitioners seeking to optimize model deployment without sacrificing performance beyond acceptable thresholds.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 70 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Fundamental Quantization Types
| Method | Example | Description |
|---|---|---|
Convert trained FP32 model to INT8 using calibration dataset | • Applies quantization after training is complete using a calibration dataset • faster and requires no retraining but may lose more accuracy than QAT. | |
Simulate quantization during training with fake-quantization nodes | • Simulates quantization effects during training by inserting fake-quantization operations in forward pass • recovers accuracy loss better than PTQ but requires retraining. | |
Weights pre-quantized to INT8, activations quantized on-the-fly | • Quantizes weights statically before inference but quantizes activations dynamically at runtime • good for models where activation ranges vary significantly across inputs. |