Adversarial versification in portuguese as a jailbreak operator in LLMs
Joao Queiroz

TL;DR
This paper reveals that converting prompts into verse significantly increases the vulnerability of large language models to jailbreak attacks, exposing limitations in current alignment methods especially in Portuguese.
Contribution
It demonstrates that versification is an effective adversarial technique against LLMs and highlights the need for evaluation protocols considering linguistic variations in Portuguese.
Findings
Versification increases safety failures up to 18x in benchmarks.
Manual poems achieve 62% success rate, automated 43%.
Models surpass 90% success in some single-turn interactions.
Abstract
Recent evidence shows that the versification of prompts constitutes a highly effective adversarial mechanism against aligned LLMs. The study 'Adversarial poetry as a universal single-turn jailbreak mechanism in large language models' demonstrates that instructions routinely refused in prose become executable when rewritten as verse, producing up to 18 x more safety failures in benchmarks derived from MLCommons AILuminate. Manually written poems reach approximately 62% ASR, and automated versions 43%, with some models surpassing 90% success in single-turn interactions. The effect is structural: systems trained with RLHF, constitutional AI, and hybrid pipelines exhibit consistent degradation under minimal semiotic formal variation. Versification displaces the prompt into sparsely supervised latent regions, revealing guardrails that are excessively dependent on surface patterns. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Phonetics and Phonology Research · Natural Language Processing Techniques
