Skip to main content

Menu

LEVEL 0
0/5 XP
HomeAboutTopicsPricingMy VaultStats

Categories

πŸ€– Artificial Intelligence
☁️ Cloud and Infrastructure
πŸ’Ύ Data and Databases
πŸ’Ό Professional Skills
🎯 Programming and Development
πŸ”’ Security and Networking
πŸ“š Specialized Topics
HomeAboutTopicsPricingMy VaultStats
LEVEL 0
0/5 XP
GitHub
Β© 2026 CheatGridβ„’. All rights reserved.
Privacy PolicyTerms of UseAboutContact

Data Wrangling Cheat Sheet

Data Wrangling Cheat Sheet

Back to Data Engineering
Updated 2026-04-21
Next Topic: Databricks Asset Bundles Cheat Sheet

Data wrangling transforms raw, messy datasets into clean, analysis-ready structures. This cheat sheet covers the full spectrum of wrangling operations across pandas (v3.0+, with Copy-on-Write enabled by default), SQL (PostgreSQL / DuckDB), PySpark, Polars, and specialized tools such as Great Expectations, pandera, Soda Core, RapidFuzz, Splink, lakeFS, DVC, ydata-profiling, and OpenRefine. Techniques are ordered from foundational tasks every analyst performs daily to advanced probabilistic and distributed workflows.


What This Cheat Sheet Covers

This topic spans 23 focused tables and 146 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

1. Missing Data Handling2. Data Type Conversion3. String Cleaning and Normalization4. String Extraction and Splitting5. Column and Index Management6. Filtering and Sampling7. Deduplication8. Reshaping Data9. Exploding and Flattening Nested Data10. Joining and Merging11. Aggregation and Grouping12. Window and Rolling Operations13. Date and Time Wrangling14. Outlier Detection and Capping15. Binning and Encoding16. Method Chaining and Pipelines17. Data Quality and Validation18. Fuzzy Matching and Deduplication19. Probabilistic Record Linkage20. Lazy and Distributed Wrangling21. Data Versioning and Lineage22. Data Profiling23. GREL Transformations (OpenRefine)

1. Missing Data Handling

TechniqueExampleDescription
dropna
df.dropna(subset=["col"])
β€’ Drop rows (or columns with axis=1) where specified fields are null &bull
β€’ how="all" drops only if every value is missing &bull
β€’ thresh=n keeps rows with at least n non-null values
fillna
df["col"].fillna(df["col"].median())
β€’ Replace NaN with a scalar, dict, Series, or method &bull
β€’ method="ffill" / "bfill" propagates last valid value forward / backward
interpolate
df["col"].interpolate(method="linear")
β€’ Fill gaps using interpolation &bull
β€’ Methods include "linear", "time", "polynomial", "spline" &bull
β€’ Best for ordered numeric / time-series data
pd.NA
df = df.convert_dtypes()
β€’ Pandas 3.0 uses pd.NA (not np.nan) as the canonical missing sentinel for nullable dtypes &bull
β€’ Propagates correctly through boolean and integer operations

More in Data Engineering

  • Data Warehousing Cheat Sheet
  • Databricks Asset Bundles Cheat Sheet
  • Airbyte Open-Source ELT Cheat Sheet
  • Big Data Storage Formats Cheat Sheet
  • Databricks Notebooks Cheat Sheet
  • Great Expectations Data Quality Cheat Sheet
View all 53 topics in Data Engineering