Transfer Learning Cheat Sheet

Updated 2026-04-28

Next Topic: Uncertainty Quantification and Prediction Calibration Cheat Sheet

🧠Study flashcards on this topic96 cards · spaced repetition→

Transfer learning reuses knowledge from models trained on large datasets to improve learning on new tasks with limited data. Originally validated on vision models pretrained on ImageNet, the paradigm now spans NLP (BERT, GPT, T5), multimodal systems (CLIP), audio (Wav2Vec, Whisper), and domain-specific applications. Rather than training from scratch, you leverage pretrained weights as initialization, freeze or fine-tune layers selectively, and adapt to target tasks efficiently. The key insight: lower layers learn general features (edges, syntax) while upper layers capture task-specific patterns — selective unfreezing, discriminative learning rates, and parameter-efficient methods like LoRA exploit this hierarchy to avoid catastrophic forgetting and negative transfer when source and target domains differ.

What This Cheat Sheet Covers

This topic spans 22 focused tables and 114 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Core Transfer Learning ApproachesTable 2: Pretrained Model SourcesTable 3: Layer Freezing StrategiesTable 4: Learning Rate Schedules for Fine-TuningTable 5: Parameter-Efficient Fine-Tuning (PEFT)Table 6: Common Pretrained ArchitecturesTable 7: Transfer Learning TypesTable 8: Regularization Techniques for Transfer LearningTable 9: Domain Adaptation MethodsTable 10: Meta-Learning for TransferTable 11: Avoiding Negative TransferTable 12: Cross-Lingual TransferTable 13: Knowledge DistillationTable 14: Self-Supervised Pretraining for TransferTable 15: Continual Learning to Avoid Catastrophic ForgettingTable 16: Initialization StrategiesTable 17: Common Transfer Learning PitfallsTable 18: Evaluation Metrics for Transfer LearningTable 19: Transfer Learning in NLPTable 20: LLM Alignment & Post-TrainingTable 21: Transfer Learning in Computer VisionTable 22: Transfer Learning for Audio & Speech

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Core Transfer Learning Approaches

The handful of fundamental moves you can make once you have a pretrained model in hand, ordered roughly from least to most invasive. At one end you freeze everything and train only a new head; at the other you adapt across domains or generalize to classes you've never labeled — and most real projects sit somewhere in between, mixing freezing, fine-tuning, and discriminative learning rates.

Approach	Example	Description
Feature extraction	`model = ResNet50(weights='imagenet')` `model.fc = Linear(2048, 10)`	• Freeze all pretrained layers and train only the new classifier head on target data • fast, low compute, effective when target task is similar to source.
Fine-tuning	`for param in model.parameters():` `param.requires_grad = True` `optimizer = Adam(model.parameters(), lr=1e-5)`	• Unfreeze some or all pretrained layers and retrain with a low learning rate to adapt features to the target task • higher accuracy but risks overfitting on small datasets.
Discriminative fine-tuning	`optimizer = Adam([` `{'params': model.layer1.parameters(), 'lr': 1e-5},` `{'params': model.fc.parameters(), 'lr': 1e-3}]` `)`	Assign different learning rates per layer — lower for early layers (preserve general features), higher for late layers (adapt task-specific patterns).
Gradual unfreezing	Epoch 1: freeze all but head Epoch 5: unfreeze last block Epoch 10: unfreeze all	• Incrementally unfreeze layers from top to bottom during training • prevents catastrophic forgetting by allowing the model to adapt progressively.

Table 1: Core Transfer Learning Approaches

Approach	Example	Description
Feature extraction	`model = ResNet50(weights='imagenet')` `model.fc = Linear(2048, 10)`	• Freeze all pretrained layers and train only the new classifier head on target data • fast, low compute, effective when target task is similar to source.
Fine-tuning	`for param in model.parameters():` `param.requires_grad = True` `optimizer = Adam(model.parameters(), lr=1e-5)`	• Unfreeze some or all pretrained layers and retrain with a low learning rate to adapt features to the target task • higher accuracy but risks overfitting on small datasets.
Discriminative fine-tuning	`optimizer = Adam([` `{'params': model.layer1.parameters(), 'lr': 1e-5},` `{'params': model.fc.parameters(), 'lr': 1e-3}]` `)`	Assign different learning rates per layer — lower for early layers (preserve general features), higher for late layers (adapt task-specific patterns).
Gradual unfreezing	Epoch 1: freeze all but head Epoch 5: unfreeze last block Epoch 10: unfreeze all	• Incrementally unfreeze layers from top to bottom during training • prevents catastrophic forgetting by allowing the model to adapt progressively.