TL;DR
This paper advocates for pairwise evaluation methods in NLP, like the Bradley-Terry model, over traditional averaging, demonstrating their impact on system ranking and providing tools for implementation.
Contribution
It introduces pairwise evaluation techniques for NLP system comparison, highlighting their advantages and offering practical tools for adoption.
Findings
Different aggregation methods lead to different system rankings in 30% of cases.
Pairwise methods better capture the relative performance of NLP systems.
The Bradley-Terry model improves evaluation accuracy over simple averages.
Abstract
Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instance-level pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the Bradley-Terry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest
