Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner, Vivian Y. Nastl, Moritz Hardt

TL;DR
This paper investigates the limitations of using large language models as evaluators, showing that when the judge's accuracy matches the model being evaluated, debiasing cannot significantly reduce the need for ground truth labels, especially at the evaluation frontier.
Contribution
The paper provides a theoretical limit on debiasing effectiveness in LLM-based evaluation when judge accuracy is equal to the evaluated model, highlighting fundamental challenges.
Findings
Debiasing cannot reduce ground truth labels by more than half when judge accuracy equals model accuracy.
Empirical results show practical sample size savings are even more limited than theoretical bounds.
Identifies limitations of LLM-as-a-judge paradigm at the frontier of model evaluation.
Abstract
High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMedical Malpractice and Liability Issues · Legal Education and Practice Innovations
