LLM fine-tuning is the process of adapting pre-trained large language models to specific tasks, domains, or behaviors by continuing training on custom datasets. Born from the need to customize foundation models without the cost of training from scratch, fine-tuning has evolved into a discipline spanning parameter-efficient methods (PEFT), preference alignment (RLHF, DPO, SimPO), reinforcement learning with verifiable rewards (RLVR), and post-training model merging. In 2026, reinforcement fine-tuning with algorithms like GRPO—the engine behind DeepSeek-R1—has emerged alongside traditional SFT as a primary pathway for unlocking reasoning capabilities. The key insight remains that strategic parameter updates unlock specialized performance: a 7B model fine-tuned with LoRA on quality data can outperform a generic 70B model on domain-specific tasks, making fine-tuning both an art of data curation and a science of efficient training.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 118 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Fine-Tuning Approaches
| Method | Example | Description |
|---|---|---|
Train on {input, output} pairs:{"prompt": "Translate:", "completion": "..."} | Standard approach where model learns from labeled examples, updating all or subset of parameters to minimize loss between predictions and targets. | |
{"instruction": "Summarize this", "input": "text", "output": "summary"} | • Specialized SFT that teaches the model to follow natural language instructions • improves zero-shot generalization across diverse tasks. | |
Freeze base model, train 0.1–1% adapters | • Freezes most parameters, trains small modules (adapters, low-rank matrices) • achieves 95–99% of full fine-tuning performance with drastically lower memory. | |
SFT → Reward Model → PPO policy training | • Three-stage pipeline: supervised pre-training, train reward model on preferences, optimize policy via RL to maximize reward • aligns models with human values. | |
Train directly on {chosen, rejected} pairs | • Simplifies RLHF by skipping reward model • directly optimizes policy to prefer chosen over rejected via implicit reward reparameterization. |