XGBoost (eXtreme Gradient Boosting) is a highly optimized, scalable implementation of gradient-boosted decision trees that consistently ranks among the top-performing algorithms in structured-data competitions and production ML systems. It solves regression, classification, ranking, and survival problems by sequentially fitting trees to residuals, with second-order Taylor expansion of the loss enabling both speed and strong regularization. The key mental model: XGBoost is not one algorithm β it is a framework; every major behavior from tree structure to sampling to the objective function is configurable, and nearly every real-world win comes from understanding which lever to pull first.
What This Cheat Sheet Covers
This topic spans 19 focused tables and 126 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: DMatrix β Data Loading and Construction
The DMatrix is XGBoost's native data container; all training, evaluation, and prediction flows through it. Feeding data via DMatrix rather than raw arrays enables efficient internal compression and avoids redundant work across boosting rounds.
| Method | Example | Description |
|---|---|---|
dtrain = xgb.DMatrix(X, label=y) | Wraps a NumPy array or Pandas DataFrame with optional label, weight, and base_margin. | |
dtrain = xgb.DMatrix(df[feats], label=df['y']) | Accepts a pd.DataFrame; column names are preserved as feature names automatically. | |
dtrain = xgb.DMatrix(csr_matrix) | Accepts scipy.sparse.csr_matrix; implicit zeros are treated as missing, not as the value 0 β convert to dense if zeros are real values. | |
dtrain = xgb.DMatrix(X, label=y, missing=np.nan) | Explicitly declares which value should be treated as missing; default is np.nan. | |
dtrain = xgb.DMatrix(X, label=y, weight=w) | Per-sample training weights; higher weights increase a sample's influence on gradient updates. | |
dtrain = xgb.DMatrix(X, label=y, base_margin=prior_scores) | Per-sample initial prediction offset (raw margin, before link function); overrides base_score when provided; used to warm-start from another model's output. |