Mediocrity is the key for LLM as a Judge Anchor Selection

Shachar Don-Yehiya; Asaf Yehudai; Leshem Choshen; Omri Abend

arXiv:2603.16848·cs.CL·March 18, 2026

Mediocrity is the key for LLM as a Judge Anchor Selection

Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend

PDF

Open Access 1 Datasets

TL;DR

This paper investigates how the choice of anchor models affects the reliability of LLM-based evaluations, revealing that poor anchor selection can significantly distort model rankings and proposing guidelines for better anchor choices.

Contribution

It systematically analyzes the impact of anchor selection in LLM evaluation, providing empirical evidence and actionable recommendations for more reliable benchmarking.

Findings

01

Anchor choice significantly affects evaluation correlation with human rankings.

02

Extreme anchors (best/worst models) are poor choices for reliable evaluation.

03

Standard benchmark sizes are often insufficient for distinguishing competitive models.

Abstract

The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ibm-research/900K-Judgements
dataset· 91 dl
91 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI · Topic Modeling