LLM Pre-training and Scaling Laws Cheat Sheet

Updated 2026-05-18

Next Topic: LLM Reasoning and Test-Time Compute Scaling Cheat Sheet

Pre-training large language models involves carefully balancing compute budget, data quality, model architecture, and training hyperparameters according to empirical scaling laws — most notably the Chinchilla scaling law, which reveals that for compute-optimal training, model parameters and training tokens should scale equally. This cheat sheet covers the full pre-training lifecycle: from tokenization strategies and data curation through distributed training frameworks, optimizer configuration, stability techniques, and monitoring, equipping practitioners with the knowledge to train models efficiently at scale.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 71 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Scaling Laws and Compute BudgetTable 2: Tokenization StrategiesTable 3: Training Data CurationTable 4: Distributed Training FrameworksTable 5: Pre-training ObjectivesTable 6: Optimizer ConfigurationTable 7: Learning Rate SchedulesTable 8: Mixed Precision TrainingTable 9: Training StabilityTable 10: Architecture ConfigurationTable 11: Data Quality and DeduplicationTable 12: Monitoring and CheckpointingTable 13: Regularization and Overfitting PreventionTable 14: Batch Size and Training Dynamics

Table 1: Scaling Laws and Compute Budget

Understanding how model performance scales with compute, parameters, and data is foundational to efficient pre-training — Chinchilla scaling laws provide the formula for compute-optimal training, while Kaplan scaling laws describe the power-law relationship between loss and resources, guiding practitioners in allocating limited compute budgets.

Law	Example	Description
Chinchilla Scaling Law	$N_{opt} \propto C^{0.5}$ $D_{opt} \propto C^{0.5}$	• Model size and tokens scale equally with compute budget • for every doubling of model size, double training tokens — recommends ~20 tokens per parameter for compute-optimal training
Chinchilla Ratio	`70B params × 20 = 1.4T tokens`	• 20 tokens per parameter is the compute-optimal ratio • training Chinchilla 70B on 1.4 trillion tokens achieved better performance than Gopher 280B trained on 300B tokens
Kaplan Scaling Law	$L(N) = (N_c / N)^\alpha$ $\alpha \approx 0.076$	• Loss scales as a power law of model size • doubling parameters reduces loss predictably — earlier work that overemphasized model size relative to data

Table 1: Scaling Laws and Compute Budget

Law	Example	Description
Chinchilla Scaling Law	$N_{opt} \propto C^{0.5}$ $D_{opt} \propto C^{0.5}$	• Model size and tokens scale equally with compute budget • for every doubling of model size, double training tokens — recommends ~20 tokens per parameter for compute-optimal training
Chinchilla Ratio	`70B params × 20 = 1.4T tokens`	• 20 tokens per parameter is the compute-optimal ratio • training Chinchilla 70B on 1.4 trillion tokens achieved better performance than Gopher 280B trained on 300B tokens
Kaplan Scaling Law	$L(N) = (N_c / N)^\alpha$ $\alpha \approx 0.076$	• Loss scales as a power law of model size • doubling parameters reduces loss predictably — earlier work that overemphasized model size relative to data