Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
Shihao Weng, Yang Feng, Xiaofei Xie

TL;DR
This paper introduces a policy invariance framework to evaluate the reliability of LLM safety judges, revealing they often respond inconsistently to normative shifts and structural rewrites, thus conflating agent behavior with evaluator prompts.
Contribution
It proposes a set of principles and a stress-test protocol for assessing the robustness of LLM safety judges, along with the Policy Invariance Score and Judge Card for transparency.
Findings
Judges respond similarly to normative shifts and meaningless rewrites.
Up to 9.1% verdict flips due to policy rewrites.
Existing safety scores conflate agent behavior with evaluation prompts.
Abstract
LLM-as-a-Judge pipelines have become the de facto evaluator for agent safety, yet existing benchmarks treat their verdicts as ground-truth proxies without checking whether the verdicts depend on the agent's behavior or merely on how the evaluation policy happens to be worded. We argue that any trustworthy safety judge must satisfy a basic property we call policy invariance, and we operationalize it as three testable principles: rubric-semantics invariance under certified-equivalent rewrites, rubric-threshold invariance under intentional strict-to-lenient shifts, and ambiguity-aware calibration so that verdict instability concentrates on genuinely ambiguous cases. Instantiating these principles as a stress-test protocol with four agent-class judges on trajectories drawn from ASSEBench and R-Judge, we surface a previously unmeasured failure mode: today's judges respond to meaningful…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
