Model Training & Optimization Cheat Sheet

Updated 2026-04-28

Next Topic: Multi-Task and Multi-Label Learning Cheat Sheet

🧠Study flashcards on this topic103 cards · spaced repetition→

Model training and optimization is the process of systematically improving neural network performance through algorithmic techniques that adjust weights, manage learning dynamics, and prevent overfitting. This encompasses gradient descent methods, learning rate strategies, regularization, efficient fine-tuning, and distributed training tactics that determine how effectively models learn from data. Understanding these mechanisms is essential because even the best architecture will fail without proper optimization — choosing the right optimizer, learning rate schedule, and regularization approach often makes the difference between a model that converges to high accuracy and one that struggles or overfits. The field has evolved rapidly: parameter-efficient fine-tuning (LoRA, QLoRA) and new optimizers (Muon, Lion) are challenging Adam's decade-long dominance. A key mental model: optimization is fundamentally about navigating a high-dimensional loss landscape to find parameter values that generalize well, not just minimize training error.

What This Cheat Sheet Covers

This topic spans 13 focused tables and 107 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Gradient Descent VariantsTable 2: Adaptive Learning Rate OptimizersTable 3: Learning Rate SchedulingTable 4: Regularization TechniquesTable 5: Normalization MethodsTable 6: Weight Initialization MethodsTable 7: Loss FunctionsTable 8: Training Strategies and TechniquesTable 9: Fine-tuning & PEFT TechniquesTable 10: Distributed Training StrategiesTable 11: Convergence and MonitoringTable 12: Hyperparameter TuningTable 13: Advanced Optimization Concepts

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Gradient Descent Variants

Every variant here answers the same question differently: how much data do you look at before each weight update? The spectrum runs from full-batch (the whole dataset, stable but slow) through mini-batch (the everyday default) down to pure stochastic (one example at a time, noisy but fast), with momentum and Nesterov layering velocity on top to smooth and speed the descent.

Algorithm	Example	Description
Mini-batch Gradient Descent	`for batch in batches:` `loss = compute_loss(batch)` `gradients = compute_gradients(loss)` `params -= lr * gradients`	• Uses small subsets (typically 32–256 examples) to balance speed and stability • most widely used in practice • enables efficient GPU parallelization.
Stochastic Gradient Descent (SGD)	`for x, y in dataset:` `loss = compute_loss(x, y)` `gradients = compute_gradients(loss)` `params -= lr * gradients`	• Updates parameters after each single example • noisy but fast • provides regularization through stochasticity • can escape shallow local minima.
Momentum	`v = beta * v + (1-beta) * grad` `params -= lr * v`	• Accumulates exponentially decaying moving average of past gradients • accelerates convergence and dampens oscillations • typical $\beta = 0.9$.

Table 1: Gradient Descent Variants

Algorithm	Example	Description
Mini-batch Gradient Descent	`for batch in batches:` `loss = compute_loss(batch)` `gradients = compute_gradients(loss)` `params -= lr * gradients`	• Uses small subsets (typically 32–256 examples) to balance speed and stability • most widely used in practice • enables efficient GPU parallelization.
Stochastic Gradient Descent (SGD)	`for x, y in dataset:` `loss = compute_loss(x, y)` `gradients = compute_gradients(loss)` `params -= lr * gradients`	• Updates parameters after each single example • noisy but fast • provides regularization through stochasticity • can escape shallow local minima.
Momentum	`v = beta * v + (1-beta) * grad` `params -= lr * v`	• Accumulates exponentially decaying moving average of past gradients • accelerates convergence and dampens oscillations • typical $\beta = 0.9$.