Quantity vs Quality: Investigating the Trade-Off between Sample Size and   Label Reliability

Timo Bertram; Johannes F\"urnkranz; Martin M\"uller

arXiv:2204.09462·cs.LG·April 21, 2022·1 cites

Quantity vs Quality: Investigating the Trade-Off between Sample Size and Label Reliability

Timo Bertram, Johannes F\"urnkranz, Martin M\"uller

PDF

Open Access

TL;DR

This paper investigates the trade-off between acquiring more training examples and improving label reliability through re-sampling, especially under noisy conditions, using both a poker hand application and controlled MNIST experiments.

Contribution

It introduces a systematic analysis of the sample size versus label quality trade-off and proposes validation strategies to enhance learning in noisy label scenarios.

Findings

01

Resampling becomes more beneficial as label noise increases.

02

Classifier performance declines with high levels of incorrect labels.

03

Proposed validation strategies improve label confidence estimation.

Abstract

In this paper, we study learning in probabilistic domains where the learner may receive incorrect labels but can improve the reliability of labels by repeatedly sampling them. In such a setting, one faces the problem of whether the fixed budget for obtaining training examples should rather be used for obtaining all different examples or for improving the label quality of a smaller number of examples by re-sampling their labels. We motivate this problem in an application to compare the strength of poker hands where the training signal depends on the hidden community cards, and then study it in depth in an artificial setting where we insert controlled noise levels into the MNIST database. Our results show that with increasing levels of noise, resampling previous examples becomes increasingly more important than obtaining new examples, as classifier performance deteriorates when the number…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Imbalanced Data Classification Techniques · Data Stream Mining Techniques