When Is the First Spurious Variable Selected by Sequential Regression Procedures?
Weijie J. Su

TL;DR
This paper investigates when sequential regression methods like forward stepwise, lasso, and least angle regression select the first spurious variable, revealing they often do so earlier than expected, especially with denser coefficients.
Contribution
It provides a rigorous prediction of the rank of the first spurious variable and introduces a visualization tool called the double-ranking diagram to improve sequential methods.
Findings
First spurious variable appears earlier with denser coefficients.
Lasso and least angle regression are provably equivalent in early solution stages.
Counterintuitive early selection of spurious variables persists even with independent Gaussian designs.
Abstract
Applied statisticians use sequential regression procedures to produce a ranking of explanatory variables and, in settings of low correlations between variables and strong true effect sizes, expect that variables at the very top of this ranking are truly relevant to the response. In a regime of certain sparsity levels, however, three examples of sequential procedures--forward stepwise, the lasso, and least angle regression--are shown to include the first spurious variable unexpectedly early. We derive a rigorous, sharp prediction of the rank of the first spurious variable for these three procedures, demonstrating that the first spurious variable occurs earlier and earlier as the regression coefficients become denser. This counterintuitive phenomenon persists for statistically independent Gaussian random designs and an arbitrarily large magnitude of the true effects. We gain a better…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Statistical Methods and Models · Optimal Experimental Design Methods · Advanced Multi-Objective Optimization Algorithms
