TL;DR
This paper compares the reliability of global pointwise scores and pairwise comparisons in NLP model evaluation, highlighting their respective strengths and weaknesses for different benchmarking scenarios.
Contribution
It provides an empirical analysis of global and pairwise scoring methods, offering insights into their effectiveness for NLP model benchmarking.
Findings
Global scores are more reliable for overall rankings.
Pairwise comparisons excel at identifying top models among lower-ranked ones.
Pairwise methods need more comparisons when ties are common.
Abstract
With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
