Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

Shihao Weng; Yang Feng; Xiaofei Xie

arXiv:2605.06161·cs.AI·May 8, 2026

Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges

Shihao Weng, Yang Feng, Xiaofei Xie

PDF

TL;DR

This paper introduces a policy invariance framework to evaluate the reliability of LLM safety judges, revealing they often respond inconsistently to normative shifts and structural rewrites, thus conflating agent behavior with evaluator prompts.

Contribution

It proposes a set of principles and a stress-test protocol for assessing the robustness of LLM safety judges, along with the Policy Invariance Score and Judge Card for transparency.

Findings

01

Judges respond similarly to normative shifts and meaningless rewrites.

02

Up to 9.1% verdict flips due to policy rewrites.

03

Existing safety scores conflate agent behavior with evaluation prompts.

Abstract

LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rubric-threshold invariance under intentional strict-to-lenient shifts, and ambiguity-aware calibration so that verdict instability concentrates on genuinely ambiguous cases. Instantiating these principles as a stress-test protocol with four agent-class judges on trajectories drawn from ASSEBench and R-Judge, we surface a previously unmeasured failure mode: today's judges respond to meaningful…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.