LightGBM (Light Gradient Boosting Machine) is a gradient boosting framework developed by Microsoft, introduced at NeurIPS 2017, and built for high-speed, memory-efficient training on large tabular datasets. It addresses the core bottleneck of traditional GBDT β the expensive exact-split search β through histogram-based learning, leaf-wise tree growth, Gradient-based One-Side Sampling (GOSS), and Exclusive Feature Bundling (EFB). The result is training speeds up to 20Γ faster than XGBoost with comparable accuracy. The critical mental model: unlike XGBoost's depth-wise growth, LightGBM grows the single leaf with the maximum loss reduction at each step β which converges faster but requires careful tuning of num_leaves and min_data_in_leaf to prevent extreme one-sided trees that overfit small datasets.
What This Cheat Sheet Covers
This topic spans 16 focused tables and 95 indexed concepts. Below is a complete table-by-table outline of this topic, spanning foundational concepts through advanced details.
Table 1: Core Architecture β Histogram-Based Learning and Leaf-Wise Growth
LightGBM's speed advantage comes from two structural innovations that reshape how trees are built. Understanding these mechanisms makes every subsequent parameter decision more intuitive.
| Technique | Example | Description |
|---|---|---|
max_bin=255 # default bin count | Continuous features are discretized into integer bins (default 255), reducing split search from O(#data) to O(#bins) β the primary source of LightGBM's speed advantage. | |
# grows the single leaf with max delta loss | Expands the leaf that reduces loss the most at each step, rather than all leaves at a given depth; converges faster but can overfit without proper num_leaves and min_data_in_leaf guards. | |
boosting='gbdt' # GOSS is default in gbdt | Keeps all large-gradient instances (more informative) and randomly samples small-gradient ones, rescaling sampled weights; maintains accuracy while training on a fraction of data. |