A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Leo Schwinn; Moritz Ladenburger; Tim Beyer; Mehrnaz Mofakhami; Gauthier Gidel; Stephan G\"unnemann

arXiv:2603.06594·cs.CL·March 17, 2026

A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan G\"unnemann

PDF

Open Access 3 Datasets

TL;DR

This paper critically examines the reliability of LLM-based judges in evaluating adversarial robustness, revealing significant performance degradation due to distribution shifts and proposing benchmarks to improve evaluation consistency.

Contribution

It uncovers the limitations of current LLM judges in adversarial safety evaluation and introduces ReliableBench and JudgeStressTest to better assess and improve judge reliability.

Findings

01

LLM judges often perform near random chance in adversarial settings.

02

Many attacks exploit judge weaknesses rather than causing genuine harm.

03

Proposed benchmarks help identify and mitigate judge failures.

Abstract

Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Topic Modeling