Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, Morgan Sandler

TL;DR
The paper introduces the Judge Reliability Harness, an open source tool for testing the consistency and robustness of LLM judges across various benchmarks and perturbations, revealing significant variability in judge reliability.
Contribution
It provides a novel, open source framework for systematically evaluating the reliability of LLM judges across multiple benchmarks and perturbations, highlighting current limitations.
Findings
Significant variation in judge performance across models and perturbations.
No judge was reliably consistent across all benchmarks.
Simple text modifications can cause judgment inconsistencies.
Abstract
We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently assess the reliability of these methods. Given a benchmark dataset and an LLM judge configuration, the harness generates reliability tests that evaluate both binary judgment accuracy and ordinal grading performance for free-response and agentic task formats. We evaluate four state-of-the-art judges across four benchmarks spanning safety, persuasion, misuse, and agentic behavior, and find meaningful variation in performance across models and perturbation types, highlighting opportunities to improve the robustness of LLM judges. No judge that we evaluated is uniformly reliable across benchmarks using our harness. For example, our preliminary experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning
