Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Sunishchal Dev; Andrew Sloan; Joshua Kavner; Nicholas Kong; Morgan Sandler

arXiv:2603.05399·cs.AI·March 6, 2026

Judge Reliability Harness: Stress Testing the Reliability of LLM Judges

Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, Morgan Sandler

PDF

Open Access

TL;DR

The paper introduces the Judge Reliability Harness, an open source tool for testing the consistency and robustness of LLM judges across various benchmarks and perturbations, revealing significant variability in judge reliability.

Contribution

It provides a novel, open source framework for systematically evaluating the reliability of LLM judges across multiple benchmarks and perturbations, highlighting current limitations.

Findings

01

Significant variation in judge performance across models and perturbations.

02

No judge was reliably consistent across all benchmarks.

03

Simple text modifications can cause judgment inconsistencies.

Abstract

We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently assess the reliability of these methods. Given a benchmark dataset and an LLM judge configuration, the harness generates reliability tests that evaluate both binary judgment accuracy and ordinal grading performance for free-response and agentic task formats. We evaluate four state-of-the-art judges across four benchmarks spanning safety, persuasion, misuse, and agentic behavior, and find meaningful variation in performance across models and perturbation types, highlighting opportunities to improve the robustness of LLM judges. No judge that we evaluated is uniformly reliable across benchmarks using our harness. For example, our preliminary experiments on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Topic Modeling · Adversarial Robustness in Machine Learning