Identifying Mislabeled Training Data
C. E. Brodley, M. A. Friedl

TL;DR
This paper introduces a method using multiple classifiers as noise filters to identify and remove mislabeled data, significantly improving supervised learning accuracy especially at high noise levels.
Contribution
It proposes a novel filtering approach with single, majority, and consensus filters, and evaluates their effectiveness in reducing label noise in training data.
Findings
Filtering improves classification accuracy up to 30% noise levels.
Consensus filters are conservative, retaining more good data.
Majority filters are more effective at detecting and removing bad data.
Abstract
This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classifiers that serve as noise filters for the training data. We evaluate single algorithm, majority vote and consensus filters on five datasets that are prone to labeling errors. Our experiments illustrate that filtering significantly improves classification accuracy for noise levels up to 30 percent. An analytical and empirical evaluation of the precision of our approach shows that consensus filters are conservative at throwing away good data at the expense of retaining bad data and that majority filters are better at detecting bad data at the expense of throwing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
