Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
Carsten Maple, Abhishek Kumar, Riya Tapwal

TL;DR
This paper highlights the inadequacy of reporting only single-configuration attack success rates in jailbreak evaluations, proposing new measures to better characterize attack variability and coverage.
Contribution
It introduces the Variant Sensitivity Measure (VSM) and Union Coverage (UC) to improve the assessment of parameterized jailbreak attack effectiveness.
Findings
VSM quantifies deviation of best ASR from mean across variants.
UC measures total prompt coverage by unsafe responses.
Empirical results show significant variation in attack success across configurations.
Abstract
Many jailbreak attack research papers report attack success rates for a limited number of parameter settings, even though there are many combinations of parameter settings that could be used. Further, when new jailbreak papers are released, they often benchmark results against single configurations of existing attacks. This position paper argues such practices are fundamentally insufficient for characterising the threat posed by parameterised jailbreak attacks, and comparing attacks. Most jailbreak attacks expose multiple internal parameters, system prompt templates, conversation rounds, cipher dispersion, teaching shots, and ASR varies substantially across these parameters. Reporting only the best-case configuration discards two pieces of information that defenders genuinely need: how typical that performance is across the variant space, and how much of the attack surface is missed by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
