Data Analysis Cheat Sheet
Data analysis is the systematic process of inspecting, cleaning, transforming, and modeling data to discover useful information and support decision-making. It encompasses exploratory data analysis (EDA), data cleaning, wrangling, and feature engineeringβthe critical preparatory steps that transform raw data into analysis-ready datasets. While often associated with statistics and machine learning, data analysis fundamentally serves as the bridge between raw observations and actionable insights. A key insight: spending adequate time on quality data preparation typically has a larger impact on model performance than algorithm selection itselfβwell-prepared data enables simpler models to outperform complex ones trained on poor data. Modern workflows additionally use SHAP-based feature importance, automated drift detection, and schema validation tools to ensure pipelines remain robust and production-ready.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 152 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Exploratory Data Analysis (EDA) Techniques
| Technique | Example | Description |
|---|---|---|
df.describe()df.info() | β’ Computes descriptive measuresβcount, mean, std, min, quartiles, max for numerical columns β’ info() shows data types and null counts. | |
df['category'].value_counts(normalize=True) | β’ Counts frequency of unique values in categorical columns β’ normalize=True returns proportions instead of counts. | |
df['age'].hist(bins=30)df['category'].value_counts() | β’ Examines single variable distributions using histograms, box plots, and frequency counts β’ reveals central tendency, spread, and outliers for one feature at a time. | |
df.plot.scatter(x='age', y='income')df.groupby('region')['sales'].mean() | β’ Explores relationships between two variables through scatter plots, grouped aggregations, and cross-tabulations β’ identifies correlations and patterns. | |
sns.pairplot(df, hue='target')df.corr() | β’ Examines interactions among three or more variables using pair plots, correlation matrices, and heatmaps β’ reveals complex dependencies. | |
df.corr()sns.heatmap(df.corr(), annot=True) | β’ Calculates pairwise correlation coefficients between numerical features β’ heatmap visualization reveals multicollinearity and feature relationships. |