Calibrate, Don't Curate: Label-Efficient Estimation from Noisy LLM Judges
Yanran Li

TL;DR
This paper demonstrates that in multi-judge evaluation for LLMs, retaining all judges and calibrating their outputs yields better probabilistic assessments than selecting only the most accurate judges.
Contribution
It challenges the common heuristic of discarding weaker judges, showing that calibrated full-panel evaluation outperforms accuracy-based selection across multiple benchmarks.
Findings
Full panel calibration halves the calibration error compared to top-k selection.
Including all judges improves negative log-likelihood in reward model evaluation.
Even below-chance judges can be valuable if their biases are learnable.
Abstract
Multi-judge evaluation is increasingly used to assess LLMs and reward models, and the prevailing heuristic is to curate: keep the most accurate judges and discard weaker ones. We show that this heuristic can reverse when the target is not point accuracy, but calibrated probabilistic evaluation from a labeled calibration set. Holding the aggregation and calibration procedures fixed, we compare accuracy-ranked top- judge selection with using the full judge panel. Across four labeled pairwise-evaluation benchmarks spanning LLM-as-judge and reward-model settings, the calibrated full panel consistently outperforms accuracy-based selection. On RewardBench2, retaining all judges achieves negative log-likelihood (NLL) of versus under top-5 selection, halving the calibration error. This advantage persists after judge-family deduplication and against stronger same-pipeline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
