What are the best systems? New perspectives on NLP Benchmarking

Pierre Colombo; Nathan Noiry; Ekhine Irurozki; Stephan; Clemencon

arXiv:2202.03799·cs.CL·October 10, 2022·22 cites

What are the best systems? New perspectives on NLP Benchmarking

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stephan, Clemencon

PDF

Open Access 1 Repo 8 Models 1 Video

TL;DR

This paper introduces a new, theoretically grounded method for aggregating NLP benchmark results across tasks, addressing issues with traditional averaging and providing more reliable system rankings.

Contribution

It proposes a novel ranking aggregation procedure based on social choice theory, improving the reliability of NLP system evaluations across multiple benchmarks.

Findings

01

The new method yields different rankings than mean-aggregation.

02

It is more reliable and robust in evaluating NLP systems.

03

Extensive experiments validate the approach on real and synthetic data.

Abstract

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pierrecolombo/rankingnlpsystems
noneOfficial

Models

Videos

What are the best Systems? New Perspectives on NLP Benchmarking· slideslive

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Multi-Criteria Decision Making · Reinforcement Learning in Robotics

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Weight Decay · Dropout · Dense Connections · Adam · Attention Dropout · Linear Warmup With Cosine Annealing