Towards More Robust NLP System Evaluation: Handling Missing Scores in   Benchmarks

Anas Himmi; Ekhine Irurozki; Nathan Noiry; Stephan Clemencon; and Pierre Colombo

arXiv:2305.10284·cs.CL·May 18, 2023·2 cites

Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks

Anas Himmi, Ekhine Irurozki, Nathan Noiry, Stephan Clemencon, and Pierre Colombo

PDF

Open Access

TL;DR

This paper introduces a novel method for imputing missing NLP system scores in benchmarks using partial rankings and Borda count, enabling more realistic and comprehensive evaluations.

Contribution

It formalizes the problem of missing scores in NLP benchmarks and proposes a new compatible partial ranking approach with refinements, along with an extensive new benchmark dataset.

Findings

01

The proposed method effectively imputes missing scores in NLP benchmarks.

02

Validation shows improved evaluation completeness and robustness.

03

Extended benchmark contains over 131 million scores, vastly larger than previous datasets.

Abstract

The evaluation of natural language processing (NLP) systems is crucial for advancing the field, but current benchmarking approaches often assume that all systems have scores available for all tasks, which is not always practical. In reality, several factors such as the cost of running baseline, private systems, computational limitations, or incomplete data may prevent some systems from being evaluated on entire tasks. This paper formalize an existing problem in NLP research: benchmarking when some systems scores are missing on the task, and proposes a novel approach to address it. Our method utilizes a compatible partial ranking approach to impute missing data, which is then aggregated using the Borda count method. It includes two refinements designed specifically for scenarios where either task-level or instance-level scores are available. We also introduce an extended benchmark, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multi-Criteria Decision Making