Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar

TL;DR
This paper introduces a formal framework called average bias-boundedness (A-BB) that guarantees reduced harm from bias in LLM-based judging systems, ensuring more reliable and fair AI evaluations.
Contribution
The paper proposes the A-BB framework to enforce bias reduction guarantees in LLM judges, addressing a key challenge in autonomous AI feedback systems.
Findings
Achieved bias-bounded guarantees with high correlation to original rankings.
Retained 61-99% ranking correlation across various bias settings.
Most judge-bias combinations exceeded 80% correlation.
Abstract
As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Ethics and Social Impacts of AI
