Foundation models represent a paradigm shift in artificial intelligence—large-scale neural networks pre-trained on massive, diverse datasets that serve as general-purpose starting points for a wide range of downstream tasks. Unlike traditional task-specific models trained from scratch, foundation models like GPT, BERT, T5, and their successors leverage transfer learning to adapt their broad knowledge to specialized domains with minimal additional training. The key insight: scale enables emergence—as models grow in parameters, data, and compute, they spontaneously develop capabilities like few-shot learning, reasoning, and cross-domain generalization that weren't explicitly programmed. By 2026, the frontier has split along two axes—training-time scaling (more parameters, more data) and inference-time scaling (more compute at generation)—with leading models like GPT-5, Gemini 3.1, and Claude Opus 4.7 exploiting both simultaneously while open-weight alternatives like LLaMA 4 and DeepSeek V4 have closed the capability gap at a fraction of the cost.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 117 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Pre-Training Objectives
Pre-training determines the "prior knowledge" a foundation model carries into every downstream task. Choosing the right objective shapes whether a model excels at generation, understanding, or both—and determines the architecture that fits naturally with it.
| Objective | Example | Description |
|---|---|---|
Predict next token: "The cat sat" → "on" | • Autoregressive objective predicting next token given all previous • uses unidirectional (left-to-right) attention • foundation of GPT family • enables natural text generation | |
Mask and predict: "The [MASK] sat on mat" → "cat" | • Bidirectional objective masking ~15% of tokens and predicting from full context • used in BERT • better for understanding tasks than generation | |
Mask spans: "The <X> on the <Y>" → "<X> cat sat <Y> mat" | • Sequence-to-sequence objective masking contiguous token spans • model predicts all masked spans in order • used in T5 • encourages learning longer-range dependencies than single-token MLM | |
Discriminate real vs fake: "The cat sat" vs "The dog sat" → [real, fake, real] | • Discriminative objective predicting which tokens were replaced by a generator • used in ELECTRA • sample-efficient alternative to MLM—trains on all positions instead of just 15%. |