Don't Blame the Annotator: Bias Already Starts in the Annotation Instructions
Mihir Parmar, Swaroop Mishra, Mor Geva, Chitta Baral

TL;DR
This paper investigates how biases in crowdsourcing instructions influence the data collected for NLU benchmarks, leading to overfitting and poor generalization, and offers recommendations to mitigate such biases in future dataset creation.
Contribution
It identifies and analyzes instruction bias in NLU benchmarks, demonstrating its impact on model performance and generalization, and provides guidelines for better benchmark design.
Findings
Instruction bias propagates patterns from instructions to data.
Models overestimate performance due to instruction bias.
Bias effects are influenced by pattern frequency and model size.
Abstract
In recent years, progress in NLU has been driven by benchmarks. These benchmarks are typically collected by crowdsourcing, where annotators write examples based on annotation instructions crafted by dataset creators. In this work, we hypothesize that annotators pick up on patterns in the crowdsourcing instructions, which bias them to write many similar examples that are then over-represented in the collected data. We study this form of bias, termed instruction bias, in 14 recent NLU benchmarks, showing that instruction examples often exhibit concrete patterns, which are propagated by crowdworkers to the collected data. This extends previous work (Geva et al., 2019) and raises a new concern of whether we are modeling the dataset creator's instructions, rather than the task. Through a series of experiments, we show that, indeed, instruction bias can lead to overestimation of model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Software Engineering Research
