Foundation Models in AI Cheat Sheet

Updated 2026-05-25

Next Topic: Generative Adversarial Networks (GANs) Cheat Sheet

Foundation models represent a paradigm shift in artificial intelligence—large-scale neural networks pre-trained on massive, diverse datasets that serve as general-purpose starting points for a wide range of downstream tasks. Unlike traditional task-specific models trained from scratch, foundation models like GPT, BERT, T5, and their successors leverage transfer learning to adapt their broad knowledge to specialized domains with minimal additional training. The key insight: scale enables emergence—as models grow in parameters, data, and compute, they spontaneously develop capabilities like few-shot learning, reasoning, and cross-domain generalization that weren't explicitly programmed. By 2026, the frontier has split along two axes—training-time scaling (more parameters, more data) and inference-time scaling (more compute at generation)—with leading models like GPT-5, Gemini 3.1, and Claude Opus 4.7 exploiting both simultaneously while open-weight alternatives like LLaMA 4 and DeepSeek V4 have closed the capability gap at a fraction of the cost.

What This Cheat Sheet Covers

This topic spans 16 focused tables and 117 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Pre-Training ObjectivesTable 2: Model Architecture FamiliesTable 3: Major Foundation Model SeriesTable 4: Scaling Laws and Emergent AbilitiesTable 5: Test-Time Compute and Reasoning MethodsTable 6: Fine-Tuning and Adaptation StrategiesTable 7: Prompting TechniquesTable 8: Evaluation BenchmarksTable 9: Tokenization MethodsTable 10: Context Window and Positional EncodingTable 11: Model Compression and OptimizationTable 12: Inference Optimization and ServingTable 13: Evaluation Metrics and PropertiesTable 14: Model Deployment ConsiderationsTable 15: Multimodal and Cross-Modal CapabilitiesTable 16: Domain-Specific Foundation Models

Table 1: Pre-Training Objectives

Pre-training determines the "prior knowledge" a foundation model carries into every downstream task. Choosing the right objective shapes whether a model excels at generation, understanding, or both—and determines the architecture that fits naturally with it.

Objective	Example	Description
Causal Language Modeling (CLM)	Predict next token: `"The cat sat" → "on"`	• Autoregressive objective predicting next token given all previous • uses unidirectional (left-to-right) attention • foundation of GPT family • enables natural text generation
Masked Language Modeling (MLM)	Mask and predict: `"The [MASK] sat on mat" → "cat"`	• Bidirectional objective masking ~15% of tokens and predicting from full context • used in BERT • better for understanding tasks than generation
Span Corruption	Mask spans: `"The <X> on the <Y>" → "<X> cat sat <Y> mat"`	• Sequence-to-sequence objective masking contiguous token spans • model predicts all masked spans in order • used in T5 • encourages learning longer-range dependencies than single-token MLM
Replaced Token Detection	Discriminate real vs fake: `"The cat sat" vs "The dog sat" → [real, fake, real]`	• Discriminative objective predicting which tokens were replaced by a generator • used in ELECTRA • sample-efficient alternative to MLM—trains on all positions instead of just 15%.

Table 1: Pre-Training Objectives

Objective	Example	Description
Causal Language Modeling (CLM)	Predict next token: `"The cat sat" → "on"`	• Autoregressive objective predicting next token given all previous • uses unidirectional (left-to-right) attention • foundation of GPT family • enables natural text generation
Masked Language Modeling (MLM)	Mask and predict: `"The [MASK] sat on mat" → "cat"`	• Bidirectional objective masking ~15% of tokens and predicting from full context • used in BERT • better for understanding tasks than generation
Span Corruption	Mask spans: `"The <X> on the <Y>" → "<X> cat sat <Y> mat"`	• Sequence-to-sequence objective masking contiguous token spans • model predicts all masked spans in order • used in T5 • encourages learning longer-range dependencies than single-token MLM
Replaced Token Detection	Discriminate real vs fake: `"The cat sat" vs "The dog sat" → [real, fake, real]`	• Discriminative objective predicting which tokens were replaced by a generator • used in ELECTRA • sample-efficient alternative to MLM—trains on all positions instead of just 15%.