Data Wrangling Cheat Sheet

Updated 2026-04-21

Next Topic: Databricks Asset Bundles Cheat Sheet

Data wrangling transforms raw, messy datasets into clean, analysis-ready structures. This cheat sheet covers the full spectrum of wrangling operations across pandas (v3.0+, with Copy-on-Write enabled by default), SQL (PostgreSQL / DuckDB), PySpark, Polars, and specialized tools such as Great Expectations, pandera, Soda Core, RapidFuzz, Splink, lakeFS, DVC, ydata-profiling, and OpenRefine. Techniques are ordered from foundational tasks every analyst performs daily to advanced probabilistic and distributed workflows.

What This Cheat Sheet Covers

This topic spans 23 focused tables and 146 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

1. Missing Data Handling2. Data Type Conversion3. String Cleaning and Normalization4. String Extraction and Splitting5. Column and Index Management6. Filtering and Sampling7. Deduplication8. Reshaping Data9. Exploding and Flattening Nested Data10. Joining and Merging11. Aggregation and Grouping12. Window and Rolling Operations13. Date and Time Wrangling14. Outlier Detection and Capping15. Binning and Encoding16. Method Chaining and Pipelines17. Data Quality and Validation18. Fuzzy Matching and Deduplication19. Probabilistic Record Linkage20. Lazy and Distributed Wrangling21. Data Versioning and Lineage22. Data Profiling23. GREL Transformations (OpenRefine)

1. Missing Data Handling

Technique	Example	Description
dropna	`df.dropna(subset=["col"])`	• Drop rows (or columns with `axis=1`) where specified fields are null &bull • `how="all"` drops only if every value is missing &bull • `thresh=n` keeps rows with at least n non-null values
fillna	`df["col"].fillna(df["col"].median())`	• Replace `NaN` with a scalar, dict, Series, or method &bull • `method="ffill"` / `"bfill"` propagates last valid value forward / backward
interpolate	`df["col"].interpolate(method="linear")`	• Fill gaps using interpolation &bull • Methods include `"linear"`, `"time"`, `"polynomial"`, `"spline"` &bull • Best for ordered numeric / time-series data
pd.NA	`df = df.convert_dtypes()`	• Pandas 3.0 uses `pd.NA` (not `np.nan`) as the canonical missing sentinel for nullable dtypes &bull • Propagates correctly through boolean and integer operations

1. Missing Data Handling

Technique	Example	Description
dropna	`df.dropna(subset=["col"])`	• Drop rows (or columns with `axis=1`) where specified fields are null &bull • `how="all"` drops only if every value is missing &bull • `thresh=n` keeps rows with at least n non-null values
fillna	`df["col"].fillna(df["col"].median())`	• Replace `NaN` with a scalar, dict, Series, or method &bull • `method="ffill"` / `"bfill"` propagates last valid value forward / backward
interpolate	`df["col"].interpolate(method="linear")`	• Fill gaps using interpolation &bull • Methods include `"linear"`, `"time"`, `"polynomial"`, `"spline"` &bull • Best for ordered numeric / time-series data
pd.NA	`df = df.convert_dtypes()`	• Pandas 3.0 uses `pd.NA` (not `np.nan`) as the canonical missing sentinel for nullable dtypes &bull • Propagates correctly through boolean and integer operations