On Cost-Effective LLM-as-a-Judge Improvement Techniques
Ryan Lail, Luke Markham

TL;DR
This paper empirically investigates four techniques—ensembling, criteria injection, calibration, and escalation—to improve the accuracy and cost-effectiveness of LLM-based judges in RLHF and benchmarking.
Contribution
It introduces and evaluates simple, cost-effective methods for enhancing LLM judge reliability, demonstrating significant accuracy gains and generalization across models.
Findings
Ensembling and criteria injection reach up to 85.8% accuracy, a 13.5 percentage point improvement.
Small models benefit disproportionately from ensembling, reducing costs for high-accuracy judging.
Techniques generalize across different LLM providers like OpenAI GPT and Anthropic Claude.
Abstract
Using a language model to score or rank candidate responses has become a scalable alternative to human evaluation in reinforcement learning from human feedback (RLHF) pipelines, benchmarking, and application layer evaluations. However, output reliability depends heavily on prompting and aggregation strategy. We present an empirical investigation of four drop-in techniques -- ensemble scoring, task-specific criteria injection, calibration context, and adaptive model escalation -- for improving LLM judge accuracy on RewardBench 2, with a unifying lens of noise control on the stochastic judge: ensembling as Monte Carlo averaging over per-call noise, criteria injection as between-response discrimination sharpening, and per-response score variance as an uncertainty signal. Ensemble scoring and task-specific criteria injection (the latter virtually cost free) together reach up to 85.8%…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
