Foundation models represent a paradigm shift in artificial intelligence—large-scale neural networks pre-trained on massive, diverse datasets that serve as general-purpose starting points for a wide range of downstream tasks. Unlike traditional task-specific models trained from scratch, foundation models like GPT, BERT, T5, and their successors leverage transfer learning to adapt their broad knowledge to specialized domains with minimal additional training. The key insight: scale enables emergence—as models grow in parameters, data, and compute, they spontaneously develop capabilities like few-shot learning, reasoning, and cross-domain generalization that weren't explicitly programmed. Understanding foundation models means grasping how pre-training objectives, scaling laws, and adaptation strategies combine to create AI systems that can be fine-tuned for tasks ranging from code generation to medical diagnosis with unprecedented efficiency.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 94 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Pre-Training Objectives
| Objective | Example | Description |
|---|---|---|
Predict next token: "The cat sat" → "on" | • Autoregressive objective where model predicts next token given all previous tokens • uses unidirectional (left-to-right) attention • foundation of GPT family • enables natural text generation. | |
Mask and predict: "The [MASK] sat on mat" → "cat" | • Bidirectional objective where random tokens are masked and predicted from full context • typically masks 15% of tokens • used in BERT • better for understanding tasks than generation. | |
Mask spans: "The <X> on the <Y>" → "<X> cat sat <Y> mat" | • Sequence-to-sequence objective masking contiguous token spans • model predicts all masked spans in order • used in T5 • encourages learning longer-range dependencies than single-token MLM. | |
Binary task: "I love dogs. [SEP] They are loyal." → IsNext | • Binary classification predicting if sentence B follows sentence A • used alongside MLM in original BERT • largely deprecated in modern models due to minimal performance benefit. |