Model Evaluation Cheat Sheet

Updated 2026-04-28

Next Topic: Model Monitoring and Drift Detection Cheat Sheet

🧠Study flashcards on this topic120 cards · spaced repetition→

Model evaluation is the systematic process of assessing machine learning model performance using quantitative metrics, validation strategies, and diagnostic techniques. It bridges the gap between training and deployment by answering whether a model generalizes well to unseen data rather than merely memorizing training patterns. The fundamental tension in evaluation is the bias-variance tradeoff: models must be complex enough to capture real patterns but simple enough to avoid fitting noise, and proper evaluation separates good models from dangerously overconfident ones.

What This Cheat Sheet Covers

This topic spans 24 focused tables and 123 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Data Splitting StrategiesTable 2: Cross-Validation TechniquesTable 3: Classification Metrics — Binary OutcomesTable 4: Classification Metrics — Probabilistic OutputsTable 5: Confusion Matrix ComponentsTable 6: Multiclass Classification MetricsTable 7: Multilabel Classification MetricsTable 8: Regression MetricsTable 9: Model Selection CriteriaTable 10: Overfitting and Underfitting DiagnosticsTable 11: Data Leakage PreventionTable 12: Calibration and ReliabilityTable 13: Hyperparameter Tuning MethodsTable 14: Statistical Significance TestingTable 15: Bootstrapping for Confidence IntervalsTable 16: Model Evaluation Best PracticesTable 17: Imbalanced Data EvaluationTable 18: Regression DiagnosticsTable 19: Advanced Evaluation TechniquesTable 20: Clustering Evaluation MetricsTable 21: Ranking and Retrieval MetricsTable 22: NLP and Text Generation MetricsTable 23: Computer Vision Evaluation MetricsTable 24: Fairness and Bias Metrics

Quick IndexSubscribe to unlock

A jump-to index of every table row in this cheat sheet.

Mind MapSubscribe to unlock

An interactive map of every table and concept in this topic.

Table 1: Data Splitting Strategies

Honest evaluation starts before you train anything — with how you carve the data into pieces the model is and isn't allowed to see. A simple random split works for plain tabular data, but the moment your samples share structure (time order, patient groups, class imbalance) you need a split that respects it, or the test score becomes a flattering lie.

Method	Example	Description
Holdout split	`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)`	• Divides dataset into separate training and test sets (commonly 70-30 or 80-20) • fast but high variance on small datasets • test set used only once at the end.
Three-way split	`train 60%, validation 20%, test 20%`	• Adds validation set for hyperparameter tuning • prevents test-set contamination from tuning decisions • validation guides model selection, test estimates final performance.
Stratified split	`train_test_split(X, y, test_size=0.2, stratify=y)`	• Maintains class distribution proportions across splits • critical for imbalanced datasets to ensure minority classes appear in both sets.

Table 1: Data Splitting Strategies

Method	Example	Description
Holdout split	`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)`	• Divides dataset into separate training and test sets (commonly 70-30 or 80-20) • fast but high variance on small datasets • test set used only once at the end.
Three-way split	`train 60%, validation 20%, test 20%`	• Adds validation set for hyperparameter tuning • prevents test-set contamination from tuning decisions • validation guides model selection, test estimates final performance.
Stratified split	`train_test_split(X, y, test_size=0.2, stratify=y)`	• Maintains class distribution proportions across splits • critical for imbalanced datasets to ensure minority classes appear in both sets.