Quantity vs Quality: Investigating the Trade-Off between Sample Size and Label Reliability
Timo Bertram, Johannes F\"urnkranz, Martin M\"uller

TL;DR
This paper investigates the trade-off between acquiring more training examples and improving label reliability through re-sampling, especially under noisy conditions, using both a poker hand application and controlled MNIST experiments.
Contribution
It introduces a systematic analysis of the sample size versus label quality trade-off and proposes validation strategies to enhance learning in noisy label scenarios.
Findings
Resampling becomes more beneficial as label noise increases.
Classifier performance declines with high levels of incorrect labels.
Proposed validation strategies improve label confidence estimation.
Abstract
In this paper, we study learning in probabilistic domains where the learner may receive incorrect labels but can improve the reliability of labels by repeatedly sampling them. In such a setting, one faces the problem of whether the fixed budget for obtaining training examples should rather be used for obtaining all different examples or for improving the label quality of a smaller number of examples by re-sampling their labels. We motivate this problem in an application to compare the strength of poker hands where the training signal depends on the hidden community cards, and then study it in depth in an artificial setting where we insert controlled noise levels into the MNIST database. Our results show that with increasing levels of noise, resampling previous examples becomes increasingly more important than obtaining new examples, as classifier performance deteriorates when the number…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Data Stream Mining Techniques
