Missing data is a pervasive challenge in data science and machine learning, arising in virtually every real-world dataset from sensor failures, survey non-response, data integration issues, or intentional omissions. Understanding the mechanism behind the missingness—whether data is Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)—is critical because it fundamentally determines which imputation strategies produce valid, unbiased results and which introduce systematic distortion. Rather than simply deleting incomplete observations, modern practitioners employ a sophisticated toolkit spanning statistical methods, machine learning models, and deep learning architectures to reconstruct missing values while preserving distributional properties and relationships. The key insight is that missing values themselves carry information: the pattern and location of missingness can be engineered as features, and the choice between deletion, simple imputation, or complex multivariate approaches must balance statistical rigor, computational efficiency, and the ultimate use case.
What This Cheat Sheet Covers
This topic spans 17 focused tables and 85 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Missing Data Mechanisms
| Mechanism | Example | Description |
|---|---|---|
survey_df[np.random.choice(len(survey_df), 100)] = np.nan | Missingness is independent of both observed and unobserved data • Probability of being missing is equal for all cases • Complete case analysis produces unbiased estimates • Least restrictive assumption | |
df.loc[df['age'] > 65, 'income'] = np.nan | Missingness depends only on observed data, not on the missing values themselves • Can be predicted from other variables in the dataset • Most common assumption in practice • Allows valid imputation using observed data | |
df.loc[df['depression_score'] > 8, 'depression_score'] = np.nan | Missingness depends on the unobserved values themselves • Individuals with high values are less likely to report them • Cannot be fully addressed without external information • Requires sensitivity analysis or specialized models |