On Efficient and Statistical Quality Estimation for Data Annotation

Jan-Christoph Klie; Juan Haladjian; Marc Kirchner; Rahul Nair

arXiv:2405.11919·cs.LG·May 30, 2024

On Efficient and Statistical Quality Estimation for Data Annotation

Jan-Christoph Klie, Juan Haladjian, Marc Kirchner, Rahul Nair

PDF

Open Access

TL;DR

This paper explores statistical methods for efficiently estimating annotation quality in datasets, proposing confidence intervals and acceptance sampling to reduce inspection costs while maintaining accuracy.

Contribution

It introduces a novel application of acceptance sampling for annotation quality estimation, significantly reducing sample sizes needed.

Findings

01

Acceptance sampling can cut sample sizes by up to 50%.

02

Confidence intervals help determine minimal sample sizes for error estimation.

03

The methods ensure reliable quality estimates with fewer annotations.

Abstract

Annotated datasets are an essential ingredient to train, evaluate, compare and productionalize supervised machine learning models. It is therefore imperative that annotations are of high quality. For their creation, good quality management and thereby reliable quality estimates are needed. Then, if quality is insufficient during the annotation process, rectifying measures can be taken to improve it. Quality estimation is often performed by having experts manually label instances as correct or incorrect. But checking all annotated instances tends to be expensive. Therefore, in practice, usually only subsets are inspected; sizes are chosen mostly without justification or regard to statistical power and more often than not, are relatively small. Basing estimates on small sample sizes, however, can lead to imprecise values for the error rate. Using unnecessarily large sample sizes costs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRough Sets and Fuzzy Logic · Data Mining Algorithms and Applications · Data Management and Algorithms