From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda
Piercosma Bisconti, Marcello Galisai, Matteo Prandi, Federico Pierucci, Olga Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Marcantonio Bracale Syrnikov, Daniele Nardi

TL;DR
This paper introduces Adversarial Tales, a narrative-based jailbreak technique exposing vulnerabilities in LLM safety mechanisms, and advocates for interpretability research to understand and mitigate such structurally grounded attacks.
Contribution
It presents a novel narrative-based attack method and proposes a research agenda for interpretability to address vulnerabilities in LLM safety.
Findings
Average attack success rate of 71.3% across 26 models
No model family proved reliably robust against the attack
Structural decomposition can induce models to interpret harmful content as legitimate narrative
Abstract
Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Information and Cyber Security
