Data Analysis Cheat Sheet

Data analysis is the systematic process of inspecting, cleaning, transforming, and modeling data to discover useful information and support decision-making. It encompasses exploratory data analysis (EDA), data cleaning, wrangling, and feature engineering—the critical preparatory steps that transform raw data into analysis-ready datasets. While often associated with statistics and machine learning, data analysis fundamentally serves as the bridge between raw observations and actionable insights. A key insight: spending adequate time on quality data preparation typically has a larger impact on model performance than algorithm selection itself—well-prepared data enables simpler models to outperform complex ones trained on poor data. Modern workflows additionally use SHAP-based feature importance, automated drift detection, and schema validation tools to ensure pipelines remain robust and production-ready.

What This Cheat Sheet Covers

This topic spans 15 focused tables and 152 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Exploratory Data Analysis (EDA) TechniquesTable 2: Data Cleaning TechniquesTable 3: Outlier Detection and TreatmentTable 4: Data Transformation and ScalingTable 5: Feature Encoding for Categorical VariablesTable 6: Feature Engineering TechniquesTable 7: Dimensionality Reduction and Feature SelectionTable 8: Data Aggregation and GroupingTable 9: Data Reshaping and MergingTable 10: Statistical Analysis and Hypothesis TestingTable 11: Data Wrangling OperationsTable 12: Data Quality and ValidationTable 13: Data Visualization for AnalysisTable 14: Sampling and Balancing TechniquesTable 15: Automated EDA and Monitoring Tools

Table 1: Exploratory Data Analysis (EDA) Techniques

Before you clean or model anything, you look — and these are the moves that turn a raw DataFrame into an understood one. From quick describe() summaries to Q-Q plots, correlation heatmaps, and one-line automated profiling reports, each technique answers a different question about shape, spread, missingness, and how variables relate. Working from single variables up to multivariate views is the natural order to read them in.

Technique	Example	Description
summary statistics	`df.describe()` `df.info()`	• Computes descriptive measures—count, mean, std, min, quartiles, max for numerical columns • `info()` shows data types and null counts.
value counts	`df['category'].value_counts(normalize=True)`	• Counts frequency of unique values in categorical columns • `normalize=True` returns proportions instead of counts.
univariate analysis	`df['age'].hist(bins=30)` `df['category'].value_counts()`	• Examines single variable distributions using histograms, box plots, and frequency counts • reveals central tendency, spread, and outliers for one feature at a time.
bivariate analysis	`df.plot.scatter(x='age', y='income')` `df.groupby('region')['sales'].mean()`	• Explores relationships between two variables through scatter plots, grouped aggregations, and cross-tabulations • identifies correlations and patterns.
multivariate analysis	`sns.pairplot(df, hue='target')` `df.corr()`	• Examines interactions among three or more variables using pair plots, correlation matrices, and heatmaps • reveals complex dependencies.
correlation analysis	`df.corr()` `sns.heatmap(df.corr(), annot=True)`	• Calculates pairwise correlation coefficients between numerical features • heatmap visualization reveals multicollinearity and feature relationships.

Data Analysis Cheat Sheet

Table 1: Exploratory Data Analysis (EDA) Techniques

Technique	Example	Description
summary statistics	`df.describe()` `df.info()`	• Computes descriptive measures—count, mean, std, min, quartiles, max for numerical columns • `info()` shows data types and null counts.
value counts	`df['category'].value_counts(normalize=True)`	• Counts frequency of unique values in categorical columns • `normalize=True` returns proportions instead of counts.
univariate analysis	`df['age'].hist(bins=30)` `df['category'].value_counts()`	• Examines single variable distributions using histograms, box plots, and frequency counts • reveals central tendency, spread, and outliers for one feature at a time.
bivariate analysis	`df.plot.scatter(x='age', y='income')` `df.groupby('region')['sales'].mean()`	• Explores relationships between two variables through scatter plots, grouped aggregations, and cross-tabulations • identifies correlations and patterns.
multivariate analysis	`sns.pairplot(df, hue='target')` `df.corr()`	• Examines interactions among three or more variables using pair plots, correlation matrices, and heatmaps • reveals complex dependencies.
correlation analysis	`df.corr()` `sns.heatmap(df.corr(), annot=True)`	• Calculates pairwise correlation coefficients between numerical features • heatmap visualization reveals multicollinearity and feature relationships.