When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Eddie Landesberg

arXiv:2603.12520·cs.LG·March 16, 2026

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Eddie Landesberg

PDF

Open Access

TL;DR

This paper reveals that large language model judges with moderate global correlation often fail to select the best response in practice, highlighting the importance of within-prompt ranking and pairwise evaluation for accurate decision-making.

Contribution

It demonstrates that global correlation metrics are insufficient for judging best-of-n selection tasks and proposes pairwise explicit judging as a more effective evaluation method.

Findings

01

Global correlation captures only 21% of optimal selection improvement.

02

Within-prompt correlation is significantly lower at r=0.27.

03

Pairwise explicit judging improves recovery from 21% to 61%.

Abstract

Large language models are often used as judges to score candidate responses, then validated with a single global metric such as correlation with reference labels. This can be misleading when the real deployment task is best-of-n selection within a prompt. In a 5,000-prompt best-of-4 benchmark from Chatbot Arena, a judge with moderate global correlation (r = 0.47) captures only 21.0% of the improvement that perfect selection would achieve over random choice. The gap arises because global agreement is driven largely by prompt-level baseline effects, while selection depends on within-prompt ranking: within-prompt correlation is only r_within = 0.27, and coarse pointwise scoring creates ties in 67% of pairwise comparisons. In a matched-pair best-of-2 audit, explicit pairwise judging recovers much of this lost signal, raising recovery from 21.1% to 61.2%. For judge-based selection, the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Imbalanced Data Classification Techniques