Donkii: Can Annotation Error Detection Methods Find Errors in Instruction-Tuning Datasets?
Leon Weber-Genzel, Robert Litschko, Ekaterina Artemova and, Barbara Plank

TL;DR
This paper introduces DONKII, a novel benchmark for annotation error detection in instruction-tuning datasets for language models, demonstrating the importance of AED methods in improving data quality and model performance.
Contribution
It presents the first benchmark dataset for AED in instruction tuning, along with a taxonomy of error types and evaluation of AED methods for language generation tasks.
Findings
All datasets contain clear errors that can propagate into models
Model size and AED method choice significantly affect error detection performance
Practical recommendations for using AED to improve instruction-tuning data
Abstract
Instruction tuning has become an integral part of training pipelines for Large Language Models (LLMs) and has been shown to yield strong performance gains. In an orthogonal line of research, Annotation Error Detection (AED) has emerged as a tool for detecting quality problems in gold standard labels. So far, however, the application of AED methods has been limited to classification tasks. It is an open question how well AED methods generalize to language generation settings, which are becoming more widespread via LLMs. In this paper, we present a first and novel benchmark for AED on instruction tuning data: DONKII. It comprises three instruction-tuning datasets enriched with error annotations by experts and semi-automatic methods. We also provide a novel taxonomy of error types for instruction-tuning data. We find that all three datasets contain clear errors, which sometimes propagate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
