LLM Fine-tuning Cheat Sheet

Updated 2026-04-28

Next Topic: LLM Guardrails and Safety Patterns Cheat Sheet

LLM fine-tuning is the process of adapting pre-trained large language models to specific tasks, domains, or behaviors by continuing training on custom datasets. Born from the need to customize foundation models without the cost of training from scratch, fine-tuning has evolved into a discipline spanning parameter-efficient methods (PEFT), preference alignment (RLHF, DPO, SimPO), reinforcement learning with verifiable rewards (RLVR), and post-training model merging. In 2026, reinforcement fine-tuning with algorithms like GRPO—the engine behind DeepSeek-R1—has emerged alongside traditional SFT as a primary pathway for unlocking reasoning capabilities. The key insight remains that strategic parameter updates unlock specialized performance: a 7B model fine-tuned with LoRA on quality data can outperform a generic 70B model on domain-specific tasks, making fine-tuning both an art of data curation and a science of efficient training.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 118 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Fine-Tuning ApproachesTable 2: Parameter-Efficient Fine-Tuning (PEFT) MethodsTable 3: Alignment and Preference OptimizationTable 4: Data Preparation and FormatsTable 5: Training HyperparametersTable 6: Optimization and SchedulersTable 7: Memory Optimization TechniquesTable 8: Quantization MethodsTable 9: Distributed Training StrategiesTable 10: Training Frameworks and LibrariesTable 11: Evaluation MetricsTable 12: Common Pitfalls and Best PracticesTable 13: Advanced TechniquesTable 14: Model Merging

Table 1: Fine-Tuning Approaches

Method	Example	Description
Supervised Fine-Tuning (SFT)	Train on `{input, output}` pairs: `{"prompt": "Translate:", "completion": "..."}`	Standard approach where model learns from labeled examples, updating all or subset of parameters to minimize loss between predictions and targets.
Instruction Tuning	`{"instruction": "Summarize this", "input": "text", "output": "summary"}`	• Specialized SFT that teaches the model to follow natural language instructions • improves zero-shot generalization across diverse tasks.
Parameter-Efficient Fine-Tuning (PEFT)	Freeze base model, train 0.1–1% adapters	• Freezes most parameters, trains small modules (adapters, low-rank matrices) • achieves 95–99% of full fine-tuning performance with drastically lower memory.
Reinforcement Learning from Human Feedback (RLHF)	SFT → Reward Model → PPO policy training	• Three-stage pipeline: supervised pre-training, train reward model on preferences, optimize policy via RL to maximize reward • aligns models with human values.
Direct Preference Optimization (DPO)	Train directly on `{chosen, rejected}` pairs	• Simplifies RLHF by skipping reward model • directly optimizes policy to prefer chosen over rejected via implicit reward reparameterization.

Table 1: Fine-Tuning Approaches

Method	Example	Description
Supervised Fine-Tuning (SFT)	Train on `{input, output}` pairs: `{"prompt": "Translate:", "completion": "..."}`	Standard approach where model learns from labeled examples, updating all or subset of parameters to minimize loss between predictions and targets.
Instruction Tuning	`{"instruction": "Summarize this", "input": "text", "output": "summary"}`	• Specialized SFT that teaches the model to follow natural language instructions • improves zero-shot generalization across diverse tasks.
Parameter-Efficient Fine-Tuning (PEFT)	Freeze base model, train 0.1–1% adapters	• Freezes most parameters, trains small modules (adapters, low-rank matrices) • achieves 95–99% of full fine-tuning performance with drastically lower memory.
Reinforcement Learning from Human Feedback (RLHF)	SFT → Reward Model → PPO policy training	• Three-stage pipeline: supervised pre-training, train reward model on preferences, optimize policy via RL to maximize reward • aligns models with human values.
Direct Preference Optimization (DPO)	Train directly on `{chosen, rejected}` pairs	• Simplifies RLHF by skipping reward model • directly optimizes policy to prefer chosen over rejected via implicit reward reparameterization.