A Case for Library-Level k-Means Binning in Histogram Gradient-Boosted Trees
Asher Labovich

TL;DR
This paper proposes replacing quantile binning in histogram gradient-boosted trees with a k-means based binning approach, which can improve predictive performance especially in skewed data and low-bin scenarios, with minimal overhead.
Contribution
It introduces a novel k-means binning method for GBDTs, justified by a proof of maximizing explained variance, and demonstrates its effectiveness across diverse datasets.
Findings
K-means binning performs comparably to quantile binning on most datasets.
Significant MSE improvements observed in skewed and synthetic datasets.
K-means binning recovers important split points overlooked by quantile binning.
Abstract
Modern Gradient Boosted Decision Trees (GBDTs) accelerate split finding with histogram-based binning, which reduces complexity from to by aggregating gradients into fixed-size bins. However, the predominant quantile binning strategy - designed to distribute data points evenly among bins -- may overlook critical boundary values that could enhance predictive performance. In this work, we consider a novel approach that replaces quantile binning with a -means discretizer initialized with quantile bins, and justify the swap with a proof showing how, for any -Lipschitz function, k-means maximizes the worst-case explained variance of Y obtained when treating all values in a given bin as equivalent. We test this swap against quantile and uniform binning on 33 OpenML datasets plus synthetics that control for modality, skew, and bin budget. Across 18 regression datasets,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Machine Learning and Data Classification · Advanced Neural Network Applications
