Analyzing Dataset Annotation Quality Management in the Wild

Jan-Christoph Klie; Richard Eckart de Castilho; Iryna Gurevych

arXiv:2307.08153·cs.CL·March 12, 2024·1 cites

Analyzing Dataset Annotation Quality Management in the Wild

Jan-Christoph Klie, Richard Eckart de Castilho, Iryna Gurevych

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper investigates how natural language dataset creators manage quality, analyzing 591 publications to assess adherence to recommended practices and identify common errors in annotation quality management.

Contribution

It provides a large-scale analysis of quality management practices in natural language dataset creation, highlighting adherence levels and common issues in current research.

Findings

01

Majority follow good quality management practices

02

30% of works have subpar quality management

03

Common errors include issues with inter-annotator agreement

Abstract

Data quality is crucial for training accurate, unbiased, and trustworthy machine learning models as well as for their correct evaluation. Recent works, however, have shown that even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts. While practices and guidelines regarding dataset creation projects exist, to our knowledge, large-scale analysis has yet to be performed on how quality management is conducted when creating natural language datasets and whether these recommendations are followed. Therefore, we first survey and summarize recommended quality management practices for dataset creation as described in the literature and provide suggestions for applying them. Then, we compile a corpus of 591 scientific publications introducing text datasets and annotate it for quality-related aspects,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Analyzing Dataset Annotation Quality Management in the Wild· underline

Taxonomy

TopicsResearch Data Management Practices · Semantic Web and Ontologies · Scientific Computing and Data Management