Cross-Validated Variable Selection in Tree-Based Methods Improves Predictive Performance
Amichai Painsky, Saharon Rosset

TL;DR
This paper introduces a cross-validation based variable selection method for tree models that enhances predictive accuracy and effectively utilizes categorical variables with many categories, addressing a key limitation of traditional tree algorithms.
Contribution
It proposes a novel LOO cross-validation approach for splitting variable selection in trees, improving performance and handling high-category categorical variables.
Findings
Significant performance improvements in tree and ensemble models.
Effective utilization of categorical variables with many categories.
Comparable computational complexity to CART for classification tasks.
Abstract
Recursive partitioning approaches producing tree-like models are a long standing staple of predictive modeling, in the last decade mostly as ``sub-learners'' within state of the art ensemble methods like Boosting and Random Forest. However, a fundamental flaw in the partitioning (or splitting) rule of commonly used tree building methods precludes them from treating different types of variables equally. This most clearly manifests in these methods' inability to properly utilize categorical variables with a large number of categories, which are ubiquitous in the new age of big data. Such variables can often be very informative, but current tree methods essentially leave us a choice of either not using them, or exposing our models to severe overfitting. We propose a conceptual framework to splitting using leave-one-out (LOO) cross validation for selecting the splitting variable, then…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
