XGBoost (eXtreme Gradient Boosting) is a highly optimized, scalable implementation of gradient-boosted decision trees that consistently ranks among the top-performing algorithms in structured-data competitions and production ML systems. It solves regression, classification, ranking, and survival problems by sequentially fitting trees to residuals, with second-order Taylor expansion of the loss enabling both speed and strong regularization. The key mental model: XGBoost is not one algorithm — it is a framework; every major behavior from tree structure to sampling to the objective function is configurable, and nearly every real-world win comes from understanding which lever to pull first.
What This Cheat Sheet Covers
This topic spans 19 focused tables and 126 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: DMatrix — Data Loading and Construction
The DMatrix is XGBoost's native data container; all training, evaluation, and prediction flows through it. Feeding data via DMatrix rather than raw arrays enables efficient internal compression and avoids redundant work across boosting rounds.
| Method | Example | Description |
|---|---|---|
dtrain = xgb.DMatrix(X, label=y) | Wraps a NumPy array or Pandas DataFrame with optional label, weight, and base_margin. | |
dtrain = xgb.DMatrix(df[feats], label=df['y']) | • Accepts a pd.DataFrame• column names are preserved as feature names automatically | |
dtrain = xgb.DMatrix(csr_matrix) | • Accepts scipy.sparse.csr_matrix• implicit zeros are treated as missing, not as the value 0 — convert to dense if zeros are real values | |
dtrain = xgb.DMatrix(X, label=y, missing=np.nan) | • Explicitly declares which value should be treated as missing • default is np.nan. | |
dtrain = xgb.DMatrix(X, label=y, weight=w) | • Per-sample training weights • higher weights increase a sample's influence on gradient updates | |
dtrain = xgb.DMatrix(X, label=y, base_margin=prior_scores) | • Per-sample initial prediction offset (raw margin, before link function) • overrides base_score when provided• used to warm-start from another model's output |