ML for Tabular Data Cheat Sheet

Updated 2026-05-18

Machine learning for tabular data sits at the intersection of traditional statistics and modern deep learning. Unlike image or text domains where neural networks reign supreme, tabular data presents unique challenges — heterogeneous feature types, missing values, varied scales, and complex feature interactions — where tree-based gradient boosting methods still dominate Kaggle competitions and production systems. This cheat sheet covers the full spectrum: from XGBoost hyperparameter tuning and CatBoost's native categorical handling to emerging tabular transformers like FT-Transformer and TabNet, plus critical preprocessing techniques, explainability methods, and the practical engineering decisions that separate toy models from production-ready systems.

What This Cheat Sheet Covers

This topic spans 24 focused tables and 124 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.

Table 1: Gradient Boosting LibrariesTable 2: Core XGBoost HyperparametersTable 3: LightGBM-Specific FeaturesTable 4: CatBoost AdvantagesTable 5: Tree-Based vs Deep Learning for TabularTable 6: Categorical Encoding MethodsTable 7: Missing Value StrategiesTable 8: Feature Importance TechniquesTable 9: Handling Class ImbalanceTable 10: Feature Selection MethodsTable 11: Cross-Validation StrategiesTable 12: Regularization TechniquesTable 13: Probability CalibrationTable 14: Tabular Neural NetworksTable 15: Advanced Hyperparameter TuningTable 16: Monotonic and Interaction ConstraintsTable 17: Model Explainability and InterpretabilityTable 18: Data Leakage PreventionTable 19: Outlier Detection and HandlingTable 20: GPU AccelerationTable 21: Model Deployment OptimizationsTable 22: Quantile Regression and UncertaintyTable 23: Memory and SpeedupsTable 24: Stacking and Ensemble Methods

Table 1: Gradient Boosting Libraries

The three dominant gradient boosting libraries each bring distinct optimizations and design philosophies. XGBoost pioneered regularization and sparsity-aware algorithms, LightGBM introduced histogram-based splitting and leaf-wise growth for speed, and CatBoost handles categorical features natively without preprocessing. Choice depends on dataset size, categorical cardinality, hardware constraints, and whether you need GPU acceleration or auto-handling of categories.

Library	Example	Description
XGBoost	`import xgboost as xgb` `model = xgb.XGBClassifier()` `model.fit(X_train, y_train)`	• Most mature library with extensive hyperparameter control, strong L1/L2 regularization (alpha/lambda), sparsity-aware split finding for missing values, and excellent documentation • level-wise tree growth balances structure vs depth
LightGBM	`import lightgbm as lgb` `model = lgb.LGBMClassifier()` `model.fit(X_train, y_train)`	• Fastest training on large datasets via histogram-based binning and leaf-wise growth • uses gradient-based one-side sampling (GOSS) to reduce samples and exclusive feature bundling (EFB) to reduce dimensions • lower memory footprint than XGBoost

Table 1: Gradient Boosting Libraries

Library	Example	Description
XGBoost	`import xgboost as xgb` `model = xgb.XGBClassifier()` `model.fit(X_train, y_train)`	• Most mature library with extensive hyperparameter control, strong L1/L2 regularization (alpha/lambda), sparsity-aware split finding for missing values, and excellent documentation • level-wise tree growth balances structure vs depth
LightGBM	`import lightgbm as lgb` `model = lgb.LGBMClassifier()` `model.fit(X_train, y_train)`	• Fastest training on large datasets via histogram-based binning and leaf-wise growth • uses gradient-based one-side sampling (GOSS) to reduce samples and exclusive feature bundling (EFB) to reduce dimensions • lower memory footprint than XGBoost