Metaphor Is Not All Attention Needs
Olga Sorokoletova, Francesco Giarrusso, Giacomo De Luca, Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Marcello Galisai, Vincenzo Suriani, Daniele Nardi

TL;DR
This paper investigates why literary jailbreaks bypass safety in large language models, revealing that style-induced shifts in prompt processing, rather than recognition failures, cause these vulnerabilities.
Contribution
It demonstrates that poetic prompts alter model processing in ways that evade safety mechanisms, highlighting the need for style-aware robustness in language models.
Findings
Models distinguish poetic from prose formats with high accuracy.
Poetry induces distinct processing patterns independent of safety labels.
Jailbreak success is linked to stylistic irregularities, not recognition failure.
Abstract
Large language models are increasingly deployed in safety-critical applications, where their ability to resist harmful instructions is essential. Although post-training aims to make models robust against many jailbreak strategies, recent evidence shows that stylistic reformulations, such as poetic transformation, can still bypass safety mechanisms with alarming effectiveness. This raises a central question: why do literary jailbreaks succeed? In this work, we investigate whether their effectiveness depends on specific poetic devices, on a failure to recognize literary formatting, or on deeper changes in how models process stylistically irregular prompts. We address this problem through an interpretability analysis of attention patterns. We perform input-level ablation studies to assess the contribution of individual and combinations of poetic devices; construct an interpretable vector…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
