Knowledge Distillation Cheat Sheet

Updated 2026-05-25

Knowledge distillation is a model compression technique in machine learning where knowledge from a large, complex teacher model is transferred to a smaller, more efficient student model. Originally introduced by Geoffrey Hinton in 2015, this approach enables deploying powerful AI capabilities on resource-constrained devices while maintaining competitive performance. The core insight is that the soft probability distributions (dark knowledge) produced by teacher models contain rich inter-class relationships that are lost when using only hard labels. In 2025–2026, distillation became a production-critical strategy, with landmark results like DeepSeek-R1 demonstrating that reasoning capabilities can be distilled into models achieving frontier performance at a fraction of the cost; key emerging distinctions include white-box vs. black-box teacher access and on-policy vs. off-policy training, which determine what knowledge can be transferred and how effectively.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 104 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core ConceptsTable 2: Distillation CategoriesTable 3: Training ParadigmsTable 4: Loss FunctionsTable 5: Teacher-Student ArchitecturesTable 6: Advanced TechniquesTable 7: Domain-Specific ApplicationsTable 8: Cross-Modal and Specialized MethodsTable 9: Evaluation and MetricsTable 10: Implementation ConsiderationsTable 11: Best PracticesTable 12: Common Pitfalls and ChallengesTable 13: Integration with Other Compression TechniquesTable 14: Theoretical FoundationsTable 15: Emerging Research Directions

Table 1: Core Concepts

These foundational terms define the vocabulary of knowledge distillation; understanding them — especially the difference between soft and hard targets — is prerequisite to everything that follows.

Concept	Example	Description
Knowledge Distillation	`teacher = ResNet50` `student = MobileNet` `student.train(teacher.outputs)`	• Training a compact student model to mimic a larger teacher model's behavior • transfers learned representations rather than training from scratch
Dark Knowledge	Teacher outputs: `[0.89, 0.08, 0.02, 0.01]` vs. hard label: `[1, 0, 0, 0]`	Information encoded in the full probability distribution across all classes, revealing inter-class similarities that hard labels cannot capture
Soft Targets	`soft_probs = softmax(logits / T)` `T = 3` → `[0.65, 0.20, 0.10, 0.05]`	Smoothed probability distributions from the teacher that provide richer training signals than one-hot encoded hard labels
Teacher Model	`teacher = BERT-Large` `(340M parameters, 24 layers)`	• Large, high-capacity model with strong performance • serves as the knowledge source for distillation

Table 1: Core Concepts

Concept	Example	Description
Knowledge Distillation	`teacher = ResNet50` `student = MobileNet` `student.train(teacher.outputs)`	• Training a compact student model to mimic a larger teacher model's behavior • transfers learned representations rather than training from scratch
Dark Knowledge	Teacher outputs: `[0.89, 0.08, 0.02, 0.01]` vs. hard label: `[1, 0, 0, 0]`	Information encoded in the full probability distribution across all classes, revealing inter-class similarities that hard labels cannot capture
Soft Targets	`soft_probs = softmax(logits / T)` `T = 3` → `[0.65, 0.20, 0.10, 0.05]`	Smoothed probability distributions from the teacher that provide richer training signals than one-hot encoded hard labels
Teacher Model	`teacher = BERT-Large` `(340M parameters, 24 layers)`	• Large, high-capacity model with strong performance • serves as the knowledge source for distillation