Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

Yanran Li

arXiv:2605.09702·stat.ME·May 12, 2026

Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges

Yanran Li

PDF

TL;DR

This paper demonstrates that in multi-judge evaluation for LLMs, retaining all judges and calibrating their outputs yields better probabilistic assessments than selecting only the most accurate judges.

Contribution

It challenges the common heuristic of discarding weaker judges, showing that calibrated full-panel evaluation outperforms accuracy-based selection across multiple benchmarks.

Findings

01

Full panel calibration halves the calibration error compared to top-k selection.

02

Including all judges improves negative log-likelihood in reward model evaluation.

03

Even below-chance judges can be valuable if their biases are learnable.

Abstract

Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilistic evaluation from a labeled calibration set. Holding the aggregation and calibration procedures fixed, we compare accuracy-ranked top- $k$ judge selection with using the full judge panel. Across four labeled pairwise-evaluation benchmarks spanning LLM-as-judge and reward-model settings, the calibrated full panel consistently outperforms accuracy-based selection. On RewardBench2, retaining all judges achieves negative log-likelihood (NLL) of $0.006$ versus $0.013$ under top-5 selection, halving the calibration error. This advantage persists after judge-family deduplication and against stronger same-pipeline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.