Revisiting Randomization in Greedy Model Search
Xin Chen, Jason M. Klusowski, Yan Shuo Tan, Chang Yu

TL;DR
This paper analyzes how feature subsampling in greedy forward selection impacts bias and variance, revealing that it can improve model performance beyond mere variance reduction, especially under orthogonal design assumptions.
Contribution
It provides a theoretical analysis of feature subsampling effects on greedy model search, showing bias and variance reduction and characterizing the asymptotic reweighting of coefficients.
Findings
Ensembling with feature subsampling reduces both bias and variance.
Training error and degrees of freedom are non-monotonic in subsampling rate.
The estimator adaptively reweights coefficients based on their rank, approximated by a logistic function.
Abstract
Feature subsampling is a core component of random forests and other ensemble methods. While recent theory suggests that this randomization acts solely as a variance reduction mechanism analogous to ridge regularization, these results largely rely on base learners optimized via ordinary least squares. We investigate the effects of feature subsampling on greedy forward selection, a model that better captures the adaptive nature of decision trees. Assuming an orthogonal design, we prove that ensembling with feature subsampling can reduce both bias and variance, contrasting with the pure variance reduction of convex base learners. More precisely, we show that both the training error and degrees of freedom can be non-monotonic in the subsampling rate, breaking the analogy with standard shrinkage methods like the lasso or ridge regression. Furthermore, we characterize the exact asymptotic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Mining Algorithms and Applications · Data Management and Algorithms · Machine Learning and Data Classification
