VariErr NLI: Separating Annotation Error from Human Label Variation
Leon Weber-Genzel, Siyao Peng, Marie-Catherine de Marneffe, Barbara, Plank

TL;DR
This paper introduces a methodology and dataset to distinguish annotation errors from human label variation in NLP, specifically in NLI, revealing that GPT-4 outperforms other automatic methods but still lags behind humans.
Contribution
It presents a systematic approach and a new dataset for separating annotation errors from human variation, and evaluates error detection methods including GPTs in this context.
Findings
GPT-4 outperforms other AED methods in error detection.
State-of-the-art AED methods significantly underperform GPTs and humans.
The methodology is applicable beyond NLI, aiding trustworthy NLP system development.
Abstract
Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons. These two issues are prevalent in NLP benchmarks, yet existing research has studied them in isolation. To the best of our knowledge, there exists no prior work that focuses on teasing apart error from signal, especially in cases where signal is beyond black-and-white. To fill this gap, we introduce a systematic methodology and a new dataset, VariErr (variation versus error), focusing on the NLI task in English. We propose a 2-round annotation procedure with annotators explaining each label and subsequently judging the validity of label-explanation pairs. VariErr contains 7,732 validity judgments on 1,933 explanations for 500 re-annotated MNLI items. We assess the effectiveness of various automatic error…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Multi-Head Attention · Layer Normalization · Dropout · Softmax · Dense Connections · Label Smoothing · Adam
