From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Piercosma Bisconti; Marcello Galisai; Matteo Prandi; Federico Pierucci; Olga Sorokoletova; Francesco Giarrusso; Vincenzo Suriani; Marcantonio Bracale Syrnikov; Daniele Nardi

arXiv:2601.08837·cs.CL·January 19, 2026

From Adversarial Poetry to Adversarial Tales: An Interpretability Research Agenda

Piercosma Bisconti, Marcello Galisai, Matteo Prandi, Federico Pierucci, Olga Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Marcantonio Bracale Syrnikov, Daniele Nardi

PDF

Open Access

TL;DR

This paper introduces Adversarial Tales, a narrative-based jailbreak technique exposing vulnerabilities in LLM safety mechanisms, and advocates for interpretability research to understand and mitigate such structurally grounded attacks.

Contribution

It presents a novel narrative-based attack method and proposes a research agenda for interpretability to address vulnerabilities in LLM safety.

Findings

01

Average attack success rate of 71.3% across 26 models

02

No model family proved reliably robust against the attack

03

Structural decomposition can induce models to interpret harmful content as legitimate narrative

Abstract

Safety mechanisms in LLMs remain vulnerable to attacks that reframe harmful requests through culturally coded structures. We introduce Adversarial Tales, a jailbreak technique that embeds harmful content within cyberpunk narratives and prompts models to perform functional analysis inspired by Vladimir Propp's morphology of folktales. By casting the task as structural decomposition, the attack induces models to reconstruct harmful procedures as legitimate narrative interpretation. Across 26 frontier models from nine providers, we observe an average attack success rate of 71.3%, with no model family proving reliably robust. Together with our prior work on Adversarial Poetry, these findings suggest that structurally-grounded jailbreaks constitute a broad vulnerability class rather than isolated techniques. The space of culturally coded frames that can mediate harmful intent is vast, likely…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Malware Detection Techniques · Adversarial Robustness in Machine Learning · Information and Cyber Security