Clean or Annotate: How to Spend a Limited Data Collection Budget

Derek Chen; Zhou Yu; and Samuel R. Bowman

arXiv:2110.08355·cs.CL·June 14, 2022

Clean or Annotate: How to Spend a Limited Data Collection Budget

Derek Chen, Zhou Yu, and Samuel R. Bowman

PDF

Open Access

TL;DR

This paper proposes a budget-efficient method for dataset annotation that balances initial broad labeling with targeted relabeling of likely errors, improving data quality for machine learning.

Contribution

It introduces a novel approach that reserves part of the annotation budget for targeted relabeling of probable errors, outperforming traditional aggregation and denoising methods.

Findings

01

Outperforms label aggregation and denoising methods at the same budget

02

Effective across multiple NLP tasks and model variations

03

Balances broad initial labeling with targeted error correction

Abstract

Crowdsourcing platforms are often used to collect datasets for training machine learning models, despite higher levels of inaccurate labeling compared to expert labeling. There are two common strategies to manage the impact of such noise. The first involves aggregating redundant annotations, but comes at the expense of labeling substantially fewer examples. Secondly, prior works have also considered using the entire annotation budget to label as many examples as possible and subsequently apply denoising algorithms to implicitly clean the dataset. We find a middle ground and propose an approach which reserves a fraction of annotations to explicitly clean up highly probable error samples to optimize the annotation process. In particular, we allocate a large portion of the labeling budget to form an initial dataset used to train a model. This model is then used to identify specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Data Classification · Data Stream Mining Techniques · Anomaly Detection Techniques and Applications