Feature Importance in Gradient Boosting Trees with Cross-Validation Feature Selection
Afek Ilay Adler, Amichai Painsky

TL;DR
This paper investigates the bias in feature importance measures of Gradient Boosting Machines caused by decision tree base learners and proposes a cross-validation approach to reduce this bias without sacrificing predictive performance.
Contribution
It introduces a cross-validation based method to correct bias in GBM feature importance measures, improving interpretability while maintaining accuracy.
Findings
Bias in feature importance is significant despite high predictive performance.
Using cross-validated base learners reduces bias in feature importance measures.
The proposed method improves feature importance reliability across synthetic and real datasets.
Abstract
Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state of the art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We show that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Explainable Artificial Intelligence (XAI) · Imbalanced Data Classification Techniques
