Data Analysis Cheat Sheet
Data analysis is the systematic process of inspecting, cleaning, transforming, and modeling data to discover useful information and support decision-making. It encompasses exploratory data analysis (EDA), data cleaning, wrangling, and feature engineering—the critical preparatory steps that transform raw data into analysis-ready datasets. While often associated with statistics and machine learning, data analysis fundamentally serves as the bridge between raw observations and actionable insights. A key insight: spending adequate time on quality data preparation typically has a larger impact on model performance than algorithm selection itself—well-prepared data enables simpler models to outperform complex ones trained on poor data. Modern workflows additionally use SHAP-based feature importance, automated drift detection, and schema validation tools to ensure pipelines remain robust and production-ready.
What This Cheat Sheet Covers
This topic spans 15 focused tables and 152 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Exploratory Data Analysis (EDA) Techniques
Before you clean or model anything, you look — and these are the moves that turn a raw DataFrame into an understood one. From quick describe() summaries to Q-Q plots, correlation heatmaps, and one-line automated profiling reports, each technique answers a different question about shape, spread, missingness, and how variables relate. Working from single variables up to multivariate views is the natural order to read them in.
| Technique | Example | Description |
|---|---|---|
df.describe()df.info() | • Computes descriptive measures—count, mean, std, min, quartiles, max for numerical columns • info() shows data types and null counts. | |
df['category'].value_counts(normalize=True) | • Counts frequency of unique values in categorical columns • normalize=True returns proportions instead of counts. | |
df['age'].hist(bins=30)df['category'].value_counts() | • Examines single variable distributions using histograms, box plots, and frequency counts • reveals central tendency, spread, and outliers for one feature at a time. | |
df.plot.scatter(x='age', y='income')df.groupby('region')['sales'].mean() | • Explores relationships between two variables through scatter plots, grouped aggregations, and cross-tabulations • identifies correlations and patterns. | |
sns.pairplot(df, hue='target')df.corr() | • Examines interactions among three or more variables using pair plots, correlation matrices, and heatmaps • reveals complex dependencies. | |
df.corr()sns.heatmap(df.corr(), annot=True) | • Calculates pairwise correlation coefficients between numerical features • heatmap visualization reveals multicollinearity and feature relationships. |