How Sensitive Are Safety Benchmarks to Judge Configuration Choices?
Xinran Zhang

TL;DR
This paper investigates how the configuration of LLM judges, including prompt wording and model choice, significantly impacts safety benchmark results, revealing high sensitivity and variability.
Contribution
It systematically analyzes the impact of judge prompt wording and model choice on safety benchmark measurements, highlighting their role as major sources of variance.
Findings
Prompt wording shifts harmful-response rates by up to 24.2 percentage points.
Surface rewording causes swings of up to 20.1 percentage points.
Safety rankings are moderately unstable with a mean Kendall tau of 0.89.
Abstract
Safety benchmarks such as HarmBench rely on LLM judges to classify model responses as harmful or safe, yet the judge configuration, namely the combination of judge model and judge prompt, is typically treated as a fixed implementation detail. We show this assumption is problematic. Using a 2 x 2 x 3 factorial design, we construct 12 judge prompt variants along two axes, evaluation structure and instruction framing, and apply them using a single judge model, Claude Sonnet 4-6, producing 28,812 judgments over six target models and 400 HarmBench behaviors. We find that prompt wording alone, holding the judge model fixed, shifts measured harmful-response rates by up to 24.2 percentage points, with even within-condition surface rewording causing swings of up to 20.1 percentage points. Model safety rankings are moderately unstable, with mean Kendall tau = 0.89, and category-level sensitivity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
