A Survey of Parameters Associated with the Quality of Benchmarks in NLP
Swaroop Mishra, Anjana Arunkumar, Chris Bryan, Chitta Baral

TL;DR
This paper surveys parameters related to bias in NLP benchmarks, aiming to develop a quality metric by analyzing bias properties across datasets and tasks, highlighting the need for better benchmark evaluation.
Contribution
It introduces a comprehensive survey of bias-related parameters in NLP benchmarks and proposes initial parameters to quantify benchmark quality, addressing current limitations.
Findings
Bias parameters vary across datasets and tasks
Existing bias metrics are limited and task-specific
Proposed parameters can help develop a benchmark quality metric
Abstract
Several benchmarks have been built with heavy investment in resources to track our progress in NLP. Thousands of papers published in response to those benchmarks have competed to top leaderboards, with models often surpassing human performance. However, recent studies have shown that models triumph over several popular benchmarks just by overfitting on spurious biases, without truly learning the desired task. Despite this finding, benchmarking, while trying to tackle bias, still relies on workarounds, which do not fully utilize the resources invested in benchmark creation, due to the discarding of low quality data, and cover limited sets of bias. A potential solution to these issues -- a metric quantifying quality -- remains underexplored. Inspired by successful quality indices in several domains such as power, food, and water, we take the first step towards a metric by identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Software Engineering Research
