What are the best systems? New perspectives on NLP Benchmarking
Pierre Colombo, Nathan Noiry, Ekhine Irurozki, Stephan, Clemencon

TL;DR
This paper introduces a new, theoretically grounded method for aggregating NLP benchmark results across tasks, addressing issues with traditional averaging and providing more reliable system rankings.
Contribution
It proposes a novel ranking aggregation procedure based on social choice theory, improving the reliability of NLP system evaluations across multiple benchmarks.
Findings
The new method yields different rankings than mean-aggregation.
It is more reliable and robust in evaluating NLP systems.
Extensive experiments validate the approach on real and synthetic data.
Abstract
In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗utter-project/EuroLLM-9Bmodel· 4.4k dl· ♡ 1624.4k dl♡ 162
- 🤗utter-project/EuroLLM-9B-Instructmodel· 11k dl· ♡ 20211k dl♡ 202
- 🤗QuantFactory/EuroLLM-9B-GGUFmodel· 63 dl· ♡ 363 dl♡ 3
- 🤗QuantFactory/EuroLLM-9B-Instruct-GGUFmodel· 130 dl· ♡ 7130 dl♡ 7
- 🤗stelterlab/EuroLLM-9B-Instruct-AWQmodel· 499 dl499 dl
- 🤗stelterlab/EuroLLM-9B-Instruct-MLX-4bitmodel· 59 dl· ♡ 159 dl♡ 1
- 🤗jncraton/EuroLLM-9B-Instruct-ct2-int8model· 2 dl2 dl
- 🤗RichardErkhov/utter-project_-_EuroLLM-9B-Instruct-4bitsmodel
Videos
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Multi-Criteria Decision Making · Reinforcement Learning in Robotics
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Cosine Annealing · Weight Decay · Dropout · Dense Connections · Adam · Attention Dropout · Linear Warmup With Cosine Annealing
