Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

LLM Pre-training and Scaling Laws Cheat Sheet

LLM Pre-training and Scaling Laws Cheat Sheet

Back to Generative AI
Updated 2026-05-18
Next Topic: LLM Reasoning and Test-Time Compute Scaling Cheat Sheet

Pre-training large language models involves carefully balancing compute budget, data quality, model architecture, and training hyperparameters according to empirical scaling laws β€” most notably the Chinchilla scaling law, which reveals that for compute-optimal training, model parameters and training tokens should scale equally. This cheat sheet covers the full pre-training lifecycle: from tokenization strategies and data curation through distributed training frameworks, optimizer configuration, stability techniques, and monitoring, equipping practitioners with the knowledge to train models efficiently at scale.

What This Cheat Sheet Covers

This topic spans 14 focused tables and 71 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Scaling Laws and Compute BudgetTable 2: Tokenization StrategiesTable 3: Training Data CurationTable 4: Distributed Training FrameworksTable 5: Pre-training ObjectivesTable 6: Optimizer ConfigurationTable 7: Learning Rate SchedulesTable 8: Mixed Precision TrainingTable 9: Training StabilityTable 10: Architecture ConfigurationTable 11: Data Quality and DeduplicationTable 12: Monitoring and CheckpointingTable 13: Regularization and Overfitting PreventionTable 14: Batch Size and Training Dynamics

Table 1: Scaling Laws and Compute Budget

Understanding how model performance scales with compute, parameters, and data is foundational to efficient pre-training β€” Chinchilla scaling laws provide the formula for compute-optimal training, while Kaplan scaling laws describe the power-law relationship between loss and resources, guiding practitioners in allocating limited compute budgets.

LawExampleDescription
Chinchilla Scaling Law
N_{opt} \propto C^{0.5}
D_{opt} \propto C^{0.5}
Model size and tokens scale equally with compute budget; for every doubling of model size, double training tokens β€” recommends ~20 tokens per parameter for compute-optimal training
Chinchilla Ratio
70B params Γ— 20 = 1.4T tokens
20 tokens per parameter is the compute-optimal ratio; training Chinchilla 70B on 1.4 trillion tokens achieved better performance than Gopher 280B trained on 300B tokens
Kaplan Scaling Law
L(N) = (N_c / N)^\alpha
\alpha \approx 0.076
Loss scales as a power law of model size; doubling parameters reduces loss predictably β€” earlier work that overemphasized model size relative to data

More in Generative AI

  • LLM Orchestration Cheat Sheet
  • LLM Reasoning and Test-Time Compute Scaling Cheat Sheet
  • Advanced RAG Patterns and Optimization Cheat Sheet
  • Chain-of-Thought Reasoning Cheat Sheet
  • Knowledge Distillation Cheat Sheet
  • Multimodal AI Cheat Sheet
View all 77 topics in Generative AI