Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

Florian E. Dorner; Vivian Y. Nastl; Moritz Hardt

arXiv:2410.13341·cs.LG·January 7, 2026

Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data

Florian E. Dorner, Vivian Y. Nastl, Moritz Hardt

PDF

Open Access 1 Video

TL;DR

This paper investigates the limitations of using large language models as evaluators, showing that when the judge's accuracy matches the model being evaluated, debiasing cannot significantly reduce the need for ground truth labels, especially at the evaluation frontier.

Contribution

The paper provides a theoretical limit on debiasing effectiveness in LLM-based evaluation when judge accuracy is equal to the evaluated model, highlighting fundamental challenges.

Findings

01

Debiasing cannot reduce ground truth labels by more than half when judge accuracy equals model accuracy.

02

Empirical results show practical sample size savings are even more limited than theoretical bounds.

03

Identifies limitations of LLM-as-a-judge paradigm at the frontier of model evaluation.

Abstract

High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Limits to scalable evaluation at the frontier: LLM as judge won’t beat twice the data· slideslive

Taxonomy

TopicsMedical Malpractice and Liability Issues · Legal Education and Practice Innovations