Better than Average: Paired Evaluation of NLP Systems

Maxime Peyrard; Wei Zhao; Steffen Eger; Robert West

arXiv:2110.10746·cs.CL·October 22, 2021

Better than Average: Paired Evaluation of NLP Systems

Maxime Peyrard, Wei Zhao, Steffen Eger, Robert West

PDF

1 Repo

TL;DR

This paper advocates for pairwise evaluation methods in NLP, like the Bradley-Terry model, over traditional averaging, demonstrating their impact on system ranking and providing tools for implementation.

Contribution

It introduces pairwise evaluation techniques for NLP system comparison, highlighting their advantages and offering practical tools for adoption.

Findings

01

Different aggregation methods lead to different system rankings in 30% of cases.

02

Pairwise methods better capture the relative performance of NLP systems.

03

The Bradley-Terry model improves evaluation accuracy over simple averages.

Abstract

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instance-level pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley-Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

epfl-dlab/bt-eval
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsTest