Knowledge distillation is a model compression technique in machine learning where knowledge from a large, complex teacher model is transferred to a smaller, more efficient student model. Originally introduced by Geoffrey Hinton in 2015, this approach enables deploying powerful AI capabilities on resource-constrained devices while maintaining competitive performance. The core insight is that the soft probability distributions (dark knowledge) produced by teacher models contain rich inter-class relationships that are lost when using only hard labels. Distillation has become essential for deploying deep learning models at scale, with modern applications spanning NLP transformers, computer vision CNNs, and large language models where compression ratios of 10x or more are achievable with minimal accuracy loss.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 91 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Concepts
| Concept | Example | Description |
|---|---|---|
teacher = ResNet50student = MobileNetstudent.train(teacher.outputs) | β’ Training a compact student model to mimic a larger teacher model's behavior β’ transfers learned representations rather than training from scratch | |
Teacher outputs: [0.89, 0.08, 0.02, 0.01]vs. hard label: [1, 0, 0, 0] | Information encoded in the full probability distribution across all classes, revealing inter-class similarities that hard labels cannot capture | |
soft_probs = softmax(logits / T)T = 3 β [0.65, 0.20, 0.10, 0.05] | Smoothed probability distributions from the teacher that provide richer training signals than one-hot encoded hard labels | |
teacher = BERT-Large(340M parameters, 24 layers) | β’ Large, high-capacity model with strong performance β’ serves as the knowledge source for distillation |