Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications

Categories

🎓 Certifications
🤖 Artificial Intelligence
☁️ Cloud and Infrastructure
💾 Data and Databases
💼 Professional Skills
🎯 Programming and Development
🔒 Security and Networking
📚 Specialized Topics
CheatGrid
HomeAboutTopicsPricingMy VaultStatsPractice TestsCertifications
LVLEVEL 0
0/5 XP
GitHub
© 2026 CheatGrid™. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Knowledge Distillation Cheat Sheet

Knowledge Distillation Cheat Sheet

Back to Generative AI
Updated 2026-05-25
Next Topic: LangChain Cheat Sheet

Knowledge distillation is a model compression technique in machine learning where knowledge from a large, complex teacher model is transferred to a smaller, more efficient student model. Originally introduced by Geoffrey Hinton in 2015, this approach enables deploying powerful AI capabilities on resource-constrained devices while maintaining competitive performance. The core insight is that the soft probability distributions (dark knowledge) produced by teacher models contain rich inter-class relationships that are lost when using only hard labels. In 2025–2026, distillation became a production-critical strategy, with landmark results like DeepSeek-R1 demonstrating that reasoning capabilities can be distilled into models achieving frontier performance at a fraction of the cost; key emerging distinctions include white-box vs. black-box teacher access and on-policy vs. off-policy training, which determine what knowledge can be transferred and how effectively.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 104 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core ConceptsTable 2: Distillation CategoriesTable 3: Training ParadigmsTable 4: Loss FunctionsTable 5: Teacher-Student ArchitecturesTable 6: Advanced TechniquesTable 7: Domain-Specific ApplicationsTable 8: Cross-Modal and Specialized MethodsTable 9: Evaluation and MetricsTable 10: Implementation ConsiderationsTable 11: Best PracticesTable 12: Common Pitfalls and ChallengesTable 13: Integration with Other Compression TechniquesTable 14: Theoretical FoundationsTable 15: Emerging Research Directions

Table 1: Core Concepts

These foundational terms define the vocabulary of knowledge distillation; understanding them — especially the difference between soft and hard targets — is prerequisite to everything that follows.

ConceptExampleDescription
Knowledge Distillation
teacher = ResNet50
student = MobileNet
student.train(teacher.outputs)
• Training a compact student model to mimic a larger teacher model's behavior
• transfers learned representations rather than training from scratch
Dark Knowledge
Teacher outputs: [0.89, 0.08, 0.02, 0.01]
vs. hard label: [1, 0, 0, 0]
Information encoded in the full probability distribution across all classes, revealing inter-class similarities that hard labels cannot capture
Soft Targets
soft_probs = softmax(logits / T)
T = 3 → [0.65, 0.20, 0.10, 0.05]
Smoothed probability distributions from the teacher that provide richer training signals than one-hot encoded hard labels
Teacher Model
teacher = BERT-Large
(340M parameters, 24 layers)
• Large, high-capacity model with strong performance
• serves as the knowledge source for distillation

More in Generative AI

  • In-context Learning Cheat Sheet
  • LangChain Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • ColBERT and Late Interaction Retrieval Cheat Sheet
  • LlamaIndex Cheat Sheet
  • pgvector for Postgres Vector Search Cheat Sheet
View all 95 topics in Generative AI