Plentiful Jailbreaks with String Compositions
Brian R.Y. Huang

TL;DR
This paper introduces a framework of invertible string transformations to generate diverse and effective jailbreak attacks on large language models, revealing persistent vulnerabilities in current models.
Contribution
It unifies encoding-based attacks into a compositional framework and develops an automated method to generate numerous effective jailbreak strings.
Findings
High success rates on frontier models using the proposed attack.
Encoding-based attacks remain a significant vulnerability.
Automated composition sampling enhances attack diversity.
Abstract
Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics
