Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Model Quantization Cheat Sheet

Model Quantization Cheat Sheet

Back to Generative AI
Updated 2026-05-25
Next Topic: Multi-Model Routing and LLM Gateways Cheat Sheet

Model quantization reduces the precision of neural network parameters from high-bit representations (typically 32-bit or 16-bit floating point) to lower-bit formats (such as 8-bit or 4-bit integers), enabling faster inference, reduced memory footprint, and lower computational costs while maintaining acceptable accuracy. This technique has become essential for deploying large language models (LLMs) and deep learning models on resource-constrained devices, from edge hardware to consumer GPUs. Understanding the trade-offs between quantization granularity (per-tensor, per-channel, per-group), calibration methods (min-max, entropy-based, percentile), and algorithmic approaches (post-training vs quantization-aware training) is critical for practitioners seeking to optimize model deployment without sacrificing performance beyond acceptable thresholds. Recent advances — including NVIDIA's NVFP4 format for Blackwell GPUs, rotation-based methods like QuaRot and SpinQuant for full W4A4KV4 quantization, and Microsoft's BitNet b1.58 enabling ternary-weight inference on CPUs — have significantly expanded the frontier of practical low-bit deployment.

What This Cheat Sheet Covers

This topic spans 17 focused tables and 91 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Fundamental Quantization TypesTable 2: Integer Precision FormatsTable 3: Floating-Point Precision FormatsTable 4: Advanced Quantization AlgorithmsTable 5: Quantization SchemesTable 6: Weight-Activation Precision ConfigurationsTable 7: Quantization Format StandardsTable 8: Calibration and Range DeterminationTable 9: Rounding MethodsTable 10: Mixed-Precision StrategiesTable 11: Hardware-Specific OptimizationsTable 12: Outlier-Aware TechniquesTable 13: KV Cache QuantizationTable 14: Quantization Frameworks and ToolsTable 15: Quantization Evaluation MetricsTable 16: Deployment ConsiderationsTable 17: Quantization and Fine-Tuning

Table 1: Fundamental Quantization Types

The five primary paradigms define when and how quantization is applied — before inference, during training, or natively at training time. Choosing the right paradigm is the first and most consequential decision, as it determines available accuracy-efficiency trade-offs and hardware requirements.

MethodExampleDescription
Post-Training Quantization (PTQ)
Convert trained FP32 model to INT8 using calibration dataset
• Applies quantization after training is complete using a calibration dataset
• faster and requires no retraining but may lose more accuracy than QAT
Quantization-Aware Training (QAT)
Simulate quantization during training with fake-quantization nodes
• Simulates quantization effects during training by inserting fake-quantization operations in the forward pass
• recovers accuracy loss better than PTQ but requires retraining
Weight-Only Quantization
Quantize model weights to INT4, keep activations in FP16
• Quantizes only the weights to lower precision while keeping activations in higher precision
• reduces memory bandwidth and model size with minimal accuracy loss

More in Generative AI

  • Mistral AI Models Cheat Sheet
  • Multi-Model Routing and LLM Gateways Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • ColBERT and Late Interaction Retrieval Cheat Sheet
  • LangSmith Cheat Sheet
  • pgvector for Postgres Vector Search Cheat Sheet
View all 95 topics in Generative AI