Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Georgii Levtsov; Dmitry Ustalov

arXiv:2507.01633·cs.CL·September 24, 2025

Confidence and Stability of Global and Pairwise Scores in NLP Evaluation

Georgii Levtsov, Dmitry Ustalov

PDF

1 Video

TL;DR

This paper compares the reliability of global pointwise scores and pairwise comparisons in NLP model evaluation, highlighting their respective strengths and weaknesses for different benchmarking scenarios.

Contribution

It provides an empirical analysis of global and pairwise scoring methods, offering insights into their effectiveness for NLP model benchmarking.

Findings

01

Global scores are more reliable for overall rankings.

02

Pairwise comparisons excel at identifying top models among lower-ranked ones.

03

Pairwise methods need more comparisons when ties are common.

Abstract

With the advent of highly capable instruction-tuned neural language models, benchmarking in natural language processing (NLP) is increasingly shifting towards pairwise comparison leaderboards, such as LMSYS Arena, from traditional global pointwise scores (e.g., GLUE, BIG-bench, SWE-bench). This paper empirically investigates the strengths and weaknesses of both global scores and pairwise comparisons to aid decision-making in selecting appropriate model evaluation strategies. Through computational experiments on synthetic and real-world datasets using standard global metrics and the popular Bradley-Terry model for pairwise comparisons, we found that while global scores provide more reliable overall rankings, they can underestimate strong models with rare, significant errors or low confidence. Conversely, pairwise comparisons are particularly effective for identifying strong contenders…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Confidence and Stability of Global and Pairwise Scores in NLP Evaluation· underline