Data-centric AI shifts focus from model architecture optimization to systematic improvement of training data quality, diversity, and consistency. Modern ML systems require robust data management practices spanning versioning, validation, augmentation, documentation, and governance to ensure reproducible, ethical, and high-performing models at scale.
What This Cheat Sheet Covers
This topic spans 13 focused tables and 91 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Data-Centric AI Core Principles
Data-centric AI prioritizes improving dataset quality over model complexity, recognizing that better data yields better models. This approach emphasizes systematic data engineering practices including consistency checking, error detection, and iterative dataset refinement as the primary driver of model performance improvements.
| Principle | Example | Description |
|---|---|---|
Focus on fixing mislabeled samples in training data rather than adding more layers to the model | Andrew Ng's data-centric AI approach: small, high-quality datasets often outperform large, noisy datasets when paired with modern architectures | |
Standardize annotation guidelines so all annotators label "ambiguous" examples the same way | Reduces label noise that degrades model generalization; critical for multi-annotator datasets | |
Generate synthetic edge cases to balance underrepresented classes in medical imaging datasets | Augmentation as a data engineering strategy rather than a model training trick; focuses on principled dataset expansion |