Knowledge distillation is a model compression technique in machine learning where knowledge from a large, complex teacher model is transferred to a smaller, more efficient student model. Originally introduced by Geoffrey Hinton in 2015, this approach enables deploying powerful AI capabilities on resource-constrained devices while maintaining competitive performance. The core insight is that the soft probability distributions (dark knowledge) produced by teacher models contain rich inter-class relationships that are lost when using only hard labels. In 2025–2026, distillation became a production-critical strategy, with landmark results like DeepSeek-R1 demonstrating that reasoning capabilities can be distilled into models achieving frontier performance at a fraction of the cost; key emerging distinctions include white-box vs. black-box teacher access and on-policy vs. off-policy training, which determine what knowledge can be transferred and how effectively.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 104 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Concepts
These foundational terms define the vocabulary of knowledge distillation; understanding them — especially the difference between soft and hard targets — is prerequisite to everything that follows.
| Concept | Example | Description |
|---|---|---|
teacher = ResNet50student = MobileNetstudent.train(teacher.outputs) | • Training a compact student model to mimic a larger teacher model's behavior • transfers learned representations rather than training from scratch | |
Teacher outputs: [0.89, 0.08, 0.02, 0.01]vs. hard label: [1, 0, 0, 0] | Information encoded in the full probability distribution across all classes, revealing inter-class similarities that hard labels cannot capture | |
soft_probs = softmax(logits / T)T = 3 → [0.65, 0.20, 0.10, 0.05] | Smoothed probability distributions from the teacher that provide richer training signals than one-hot encoded hard labels | |
teacher = BERT-Large(340M parameters, 24 layers) | • Large, high-capacity model with strong performance • serves as the knowledge source for distillation |