Missing Data Analysis and Imputation Cheat Sheet

Updated 2026-05-15

Next Topic: MLflow Experiment Tracking and Model Registry Cheat Sheet

Missing data is a pervasive challenge in data science and machine learning, arising in virtually every real-world dataset from sensor failures, survey non-response, data integration issues, or intentional omissions. Understanding the mechanism behind the missingness—whether data is Missing Completely At Random (MCAR), Missing At Random (MAR), or Missing Not At Random (MNAR)—is critical because it fundamentally determines which imputation strategies produce valid, unbiased results and which introduce systematic distortion. Rather than simply deleting incomplete observations, modern practitioners employ a sophisticated toolkit spanning statistical methods, machine learning models, and deep learning architectures to reconstruct missing values while preserving distributional properties and relationships. The key insight is that missing values themselves carry information: the pattern and location of missingness can be engineered as features, and the choice between deletion, simple imputation, or complex multivariate approaches must balance statistical rigor, computational efficiency, and the ultimate use case.

What This Cheat Sheet Covers

This topic spans 17 focused tables and 85 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Missing Data MechanismsTable 2: Missing Data VisualizationTable 3: Univariate Imputation StrategiesTable 4: Multivariate Imputation MethodsTable 5: Deletion StrategiesTable 6: Advanced Deep Learning ImputationTable 7: Imputation in Machine Learning PipelinesTable 8: Handling Missingness as a FeatureTable 9: Time Series Specific ImputationTable 10: Evaluation of Imputation QualityTable 11: Missing Value Generation for Testing (Amputation)Table 12: Specialized Imputation LibrariesTable 13: Matrix Factorization and Collaborative FilteringTable 14: Regression-Based ImputationTable 15: Categorical Data ImputationTable 16: Model-Based and Probabilistic ImputationTable 17: Best Practices and Guidelines

Table 1: Missing Data Mechanisms

Before choosing any method, you have to ask why the data is missing—and the answer falls into three classic categories: MCAR, MAR, and MNAR. This distinction is not academic hair-splitting; it dictates which imputation strategies stay unbiased and which quietly corrupt your results, and tests like Little's help you diagnose where you stand.

Mechanism	Example	Description
MCAR (Missing Completely At Random)	`survey_df[np.random.choice(len(survey_df), 100)] = np.nan`	• Missingness is independent of both observed and unobserved data • Probability of being missing is equal for all cases • Complete case analysis produces unbiased estimates • Least restrictive assumption
MAR (Missing At Random)	`df.loc[df['age'] > 65, 'income'] = np.nan`	• Missingness depends only on observed data, not on the missing values themselves • Can be predicted from other variables in the dataset • Most common assumption in practice • Allows valid imputation using observed data
MNAR (Missing Not At Random)	`df.loc[df['depression_score'] > 8, 'depression_score'] = np.nan`	• Missingness depends on the unobserved values themselves • Individuals with high values are less likely to report them • Cannot be fully addressed without external information • Requires sensitivity analysis or specialized models

Table 1: Missing Data Mechanisms

Mechanism	Example	Description
MCAR (Missing Completely At Random)	`survey_df[np.random.choice(len(survey_df), 100)] = np.nan`	• Missingness is independent of both observed and unobserved data • Probability of being missing is equal for all cases • Complete case analysis produces unbiased estimates • Least restrictive assumption
MAR (Missing At Random)	`df.loc[df['age'] > 65, 'income'] = np.nan`	• Missingness depends only on observed data, not on the missing values themselves • Can be predicted from other variables in the dataset • Most common assumption in practice • Allows valid imputation using observed data
MNAR (Missing Not At Random)	`df.loc[df['depression_score'] > 8, 'depression_score'] = np.nan`	• Missingness depends on the unobserved values themselves • Individuals with high values are less likely to report them • Cannot be fully addressed without external information • Requires sensitivity analysis or specialized models