Model Quantization Cheat Sheet

Updated 2026-03-17

Next Topic: Multi-Model Routing and LLM Gateways Cheat Sheet

Model quantization reduces the precision of neural network parameters from high-bit representations (typically 32-bit or 16-bit floating point) to lower-bit formats (such as 8-bit or 4-bit integers), enabling faster inference, reduced memory footprint, and lower computational costs while maintaining acceptable accuracy. This technique has become essential for deploying large language models (LLMs) and deep learning models on resource-constrained devices, from edge hardware to consumer GPUs. Understanding the trade-offs between quantization granularity (per-tensor, per-channel, per-group), calibration methods (min-max, entropy-based, percentile), and algorithmic approaches (post-training vs quantization-aware training) is critical for practitioners seeking to optimize model deployment without sacrificing performance beyond acceptable thresholds.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 70 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Fundamental Quantization TypesTable 2: Integer Precision FormatsTable 3: Floating-Point Precision FormatsTable 4: Advanced Quantization AlgorithmsTable 5: Quantization SchemesTable 6: Quantization Format StandardsTable 7: Calibration and Range DeterminationTable 8: Rounding MethodsTable 9: Mixed-Precision StrategiesTable 10: Hardware-Specific OptimizationsTable 11: Outlier-Aware TechniquesTable 12: KV Cache QuantizationTable 13: Quantization Frameworks and ToolsTable 14: Quantization Evaluation MetricsTable 15: Deployment ConsiderationsTable 16: Quantization and Fine-Tuning

Table 1: Fundamental Quantization Types

Method	Example	Description
Post-Training Quantization (PTQ)	Convert trained FP32 model to INT8 using calibration dataset	• Applies quantization after training is complete using a calibration dataset • faster and requires no retraining but may lose more accuracy than QAT.
Quantization-Aware Training (QAT)	Simulate quantization during training with fake-quantization nodes	• Simulates quantization effects during training by inserting fake-quantization operations in forward pass • recovers accuracy loss better than PTQ but requires retraining.
Dynamic Quantization	Weights pre-quantized to INT8, activations quantized on-the-fly	• Quantizes weights statically before inference but quantizes activations dynamically at runtime • good for models where activation ranges vary significantly across inputs.

Table 1: Fundamental Quantization Types

Method	Example	Description
Post-Training Quantization (PTQ)	Convert trained FP32 model to INT8 using calibration dataset	• Applies quantization after training is complete using a calibration dataset • faster and requires no retraining but may lose more accuracy than QAT.
Quantization-Aware Training (QAT)	Simulate quantization during training with fake-quantization nodes	• Simulates quantization effects during training by inserting fake-quantization operations in forward pass • recovers accuracy loss better than PTQ but requires retraining.
Dynamic Quantization	Weights pre-quantized to INT8, activations quantized on-the-fly	• Quantizes weights statically before inference but quantizes activations dynamically at runtime • good for models where activation ranges vary significantly across inputs.