Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Marcello Galisai; Susanna Cifani; Francesco Giarrusso; Piercosma Bisconti; Matteo Prandi; Federico Pierucci; Federico Sartore; Daniele Nardi

arXiv:2604.18487·cs.CL·April 21, 2026

Adversarial Humanities Benchmark: Results on Stylistic Robustness in Frontier Model Safety

Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Daniele Nardi

PDF

1 Datasets

TL;DR

The paper introduces the Adversarial Humanities Benchmark to evaluate the robustness of model safety refusals against stylistic obfuscation, revealing significant weaknesses in current safety techniques across frontier models.

Contribution

It presents a new benchmark that tests safety robustness against humanities-style transformations, highlighting gaps in current safety methods.

Findings

01

Original attacks have 3.84% success rate.

02

Transformed methods achieve 36.8% to 65.0% success rates.

03

Overall attack success rate is 55.75% across 31 models.

Abstract

The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

icaro-lab/ahb
dataset· 86 dl
86 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.