TL;DR
This paper critiques current NLU benchmarks for unreliability and bias, proposing criteria for better evaluation datasets and emphasizing improvements in dataset design, annotation, size, and bias mitigation.
Contribution
It introduces four criteria for effective NLU benchmarks and argues that current adversarial approaches do not address core evaluation issues.
Findings
Most benchmarks fail to meet the proposed criteria.
Adversarial data collection does not solve fundamental evaluation problems.
Improving benchmarks requires better dataset design, annotation, size, and bias handling.
Abstract
Evaluation for many natural language understanding (NLU) tasks is broken: Unreliable and biased systems score so highly on standard benchmarks that there is little room for researchers who develop better systems to demonstrate their improvements. The recent trend to abandon IID benchmarks in favor of adversarially-constructed, out-of-distribution test sets ensures that current models will perform poorly, but ultimately only obscures the abilities that we want our benchmarks to measure. In this position paper, we lay out four criteria that we argue NLU benchmarks should meet. We argue most current benchmarks fail at these criteria, and that adversarial data collection does not meaningfully address the causes of these failures. Instead, restoring a healthy evaluation ecosystem will require significant progress in the design of benchmark datasets, the reliability with which they are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Data BAD | What Will it Take to Fix Benchmarking for NLU?· youtube
