StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models
Shehel Yoosuf, Temoor Ali, Ahmed Lekssays, Mashael AlSabah, Issa Khalil

TL;DR
This paper introduces structure transformation attacks on large language models' safety alignment, demonstrating high success rates and exposing vulnerabilities in current defenses, with implications for safety and security.
Contribution
The authors develop novel structure transformation attack methods, evaluate their effectiveness against state-of-the-art models, and reveal weaknesses in existing safety alignment defenses.
Findings
Achieve up to 96% attack success rate with combined transformations.
Most safety defenses fail completely against these attacks.
Generated malicious content bypasses detection effectively.
Abstract
In this work, we present a series of structure transformation attacks on LLM alignment, where we encode natural language intent using diverse syntax spaces, ranging from simple structure formats and basic query languages (e.g., SQL) to new novel spaces and syntaxes created entirely by LLMs. Our extensive evaluation shows that our simplest attacks can achieve close to a 90% success rate, even on strict LLMs (such as Claude 3.5 Sonnet) using SOTA alignment mechanisms. We improve the attack performance further by using an adaptive scheme that combines structure transformations along with existing content transformations, resulting in over 96% ASR with 0% refusals. To generalize our attacks, we explore numerous structure formats, including syntaxes purely generated by LLMs. Our results indicate that such novel syntaxes are easy to generate and result in a high ASR, suggesting that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling
