Pre-training large language models involves carefully balancing compute budget, data quality, model architecture, and training hyperparameters according to empirical scaling laws β most notably the Chinchilla scaling law, which reveals that for compute-optimal training, model parameters and training tokens should scale equally. This cheat sheet covers the full pre-training lifecycle: from tokenization strategies and data curation through distributed training frameworks, optimizer configuration, stability techniques, and monitoring, equipping practitioners with the knowledge to train models efficiently at scale.
What This Cheat Sheet Covers
This topic spans 14 focused tables and 71 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Scaling Laws and Compute Budget
Understanding how model performance scales with compute, parameters, and data is foundational to efficient pre-training β Chinchilla scaling laws provide the formula for compute-optimal training, while Kaplan scaling laws describe the power-law relationship between loss and resources, guiding practitioners in allocating limited compute budgets.
| Law | Example | Description |
|---|---|---|
N_{opt} \propto C^{0.5}D_{opt} \propto C^{0.5} | Model size and tokens scale equally with compute budget; for every doubling of model size, double training tokens β recommends ~20 tokens per parameter for compute-optimal training | |
70B params Γ 20 = 1.4T tokens | 20 tokens per parameter is the compute-optimal ratio; training Chinchilla 70B on 1.4 trillion tokens achieved better performance than Gopher 280B trained on 300B tokens | |
L(N) = (N_c / N)^\alpha\alpha \approx 0.076 | Loss scales as a power law of model size; doubling parameters reduces loss predictably β earlier work that overemphasized model size relative to data |