Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety
Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Daniele Nardi

TL;DR
The paper introduces the Adversarial Humanities Benchmark to evaluate the robustness of model safety refusals against stylistic obfuscation, revealing significant weaknesses in current safety techniques across frontier models.
Contribution
It presents a new benchmark that tests safety robustness against humanities-style transformations, highlighting gaps in current safety methods.
Findings
Original attacks have 3.84% success rate.
Transformed methods achieve 36.8% to 65.0% success rates.
Overall attack success rate is 55.75% across 31 models.
Abstract
The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
