Data wrangling transforms raw, messy datasets into clean, analysis-ready structures. This cheat sheet covers the full spectrum of wrangling operations across pandas (v3.0+, with Copy-on-Write enabled by default), SQL (PostgreSQL / DuckDB), PySpark, Polars, and specialized tools such as Great Expectations, pandera, Soda Core, RapidFuzz, Splink, lakeFS, DVC, ydata-profiling, and OpenRefine. Techniques are ordered from foundational tasks every analyst performs daily to advanced probabilistic and distributed workflows.
What This Cheat Sheet Covers
This topic spans 23 focused tables and 146 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
1. Missing Data Handling2. Data Type Conversion3. String Cleaning and Normalization4. String Extraction and Splitting5. Column and Index Management6. Filtering and Sampling7. Deduplication8. Reshaping Data9. Exploding and Flattening Nested Data10. Joining and Merging11. Aggregation and Grouping12. Window and Rolling Operations13. Date and Time Wrangling14. Outlier Detection and Capping15. Binning and Encoding16. Method Chaining and Pipelines17. Data Quality and Validation18. Fuzzy Matching and Deduplication19. Probabilistic Record Linkage20. Lazy and Distributed Wrangling21. Data Versioning and Lineage22. Data Profiling23. GREL Transformations (OpenRefine)
1. Missing Data Handling
| Technique | Example | Description |
|---|---|---|
df.dropna(subset=["col"]) | β’ Drop rows (or columns with axis=1) where specified fields are null &bullβ’ how="all" drops only if every value is missing &bullβ’ thresh=n keeps rows with at least n non-null values | |
df["col"].fillna(df["col"].median()) | β’ Replace NaN with a scalar, dict, Series, or method &bullβ’ method="ffill" / "bfill" propagates last valid value forward / backward | |
df["col"].interpolate(method="linear") | β’ Fill gaps using interpolation &bull β’ Methods include "linear", "time", "polynomial", "spline" &bullβ’ Best for ordered numeric / time-series data | |
df = df.convert_dtypes() | β’ Pandas 3.0 uses pd.NA (not np.nan) as the canonical missing sentinel for nullable dtypes &bullβ’ Propagates correctly through boolean and integer operations |