Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Anas Himmi, Ekhine Irurozki, Nathan Noiry, Stephan Clemencon, and Pierre Colombo

TL;DR
This paper introduces a novel method for imputing missing NLP system scores in benchmarks using partial rankings and Borda count, enabling more realistic and comprehensive evaluations.
Contribution
It formalizes the problem of missing scores in NLP benchmarks and proposes a new compatible partial ranking approach with refinements, along with an extensive new benchmark dataset.
Findings
The proposed method effectively imputes missing scores in NLP benchmarks.
Validation shows improved evaluation completeness and robustness.
Extended benchmark contains over 131 million scores, vastly larger than previous datasets.
Abstract
The evaluation of natural language processing (NLP) systems is crucial for advancing the field, but current benchmarking approaches often assume that all systems have scores available for all tasks, which is not always practical. In reality, several factors such as the cost of running baseline, private systems, computational limitations, or incomplete data may prevent some systems from being evaluated on entire tasks. This paper formalize an existing problem in NLP research: benchmarking when some systems scores are missing on the task, and proposes a novel approach to address it. Our method utilizes a compatible partial ranking approach to impute missing data, which is then aggregated using the Borda count method. It includes two refinements designed specifically for scenarios where either task-level or instance-level scores are available. We also introduce an extended benchmark, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multi-Criteria Decision Making
