On the Necessity of Irrelevant Variables
David P. Helmbold, Philip M. Long

TL;DR
This paper investigates how irrelevant variables impact classifier accuracy, showing that algorithms focusing on relevant variables outperform those relying on irrelevant ones, especially with limited data.
Contribution
It demonstrates that algorithms emphasizing relevant variables achieve lower error rates, even with minimal data, under assumptions of variable independence.
Findings
Algorithms relying on irrelevant variables' errors tend to vanish quickly.
Focusing on relevant variables yields bounded error rates.
Learning remains accurate even with very few examples.
Abstract
This work explores the effects of relevant and irrelevant boolean variables on the accuracy of classifiers. The analysis uses the assumption that the variables are conditionally independent given the class, and focuses on a natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing. The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables have errors bounded below by a positive constant. We also show that accurate learning is possible even when there are so few examples that one cannot determine with high confidence whether or not any individual variable is relevant.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Machine Learning and Algorithms · Neural Networks and Applications
