Targeting the Benchmark: On Methodology in Current Natural Language Processing Research
David Schlangen

TL;DR
This paper critically examines the methodology behind creating and using benchmarks in NLP research, highlighting the need for clearer progress criteria and better evaluation practices.
Contribution
It analyzes current benchmarking practices in NLP, proposing a framework to better understand and evaluate progress in the field.
Findings
Current benchmarks often lack clear justification for progress
Baseline models are frequently used without critical evaluation
The paper suggests improved methodologies for benchmarking
Abstract
It has become a common pattern in our field: One group introduces a language task, exemplified by a dataset, which they argue is challenging enough to serve as a benchmark. They also provide a baseline model for it, which then soon is improved upon by other groups. Often, research efforts then move on, and the pattern repeats itself. What is typically left implicit is the argumentation for why this constitutes progress, and progress towards what. In this paper, we try to step back for a moment from this pattern and work out possible argumentations and their parts.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
