Clean or Annotate: How to Spend a Limited Data Collection Budget
Derek Chen, Zhou Yu, and Samuel R. Bowman

TL;DR
This paper proposes a budget-efficient method for dataset annotation that balances initial broad labeling with targeted relabeling of likely errors, improving data quality for machine learning.
Contribution
It introduces a novel approach that reserves part of the annotation budget for targeted relabeling of probable errors, outperforming traditional aggregation and denoising methods.
Findings
Outperforms label aggregation and denoising methods at the same budget
Effective across multiple NLP tasks and model variations
Balances broad initial labeling with targeted error correction
Abstract
Crowdsourcing platforms are often used to collect datasets for training machine learning models, despite higher levels of inaccurate labeling compared to expert labeling. There are two common strategies to manage the impact of such noise. The first involves aggregating redundant annotations, but comes at the expense of labeling substantially fewer examples. Secondly, prior works have also considered using the entire annotation budget to label as many examples as possible and subsequently apply denoising algorithms to implicitly clean the dataset. We find a middle ground and propose an approach which reserves a fraction of annotations to explicitly clean up highly probable error samples to optimize the annotation process. In particular, we allocate a large portion of the labeling budget to form an initial dataset used to train a model. This model is then used to identify specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Data Stream Mining Techniques · Anomaly Detection Techniques and Applications
