Plentiful Jailbreaks with String Compositions

Brian R.Y. Huang

arXiv:2411.01084·cs.CL·December 12, 2024

Plentiful Jailbreaks with String Compositions

Brian R.Y. Huang

PDF

Open Access

TL;DR

This paper introduces a framework of invertible string transformations to generate diverse and effective jailbreak attacks on large language models, revealing persistent vulnerabilities in current models.

Contribution

It unifies encoding-based attacks into a compositional framework and develops an automated method to generate numerous effective jailbreak strings.

Findings

01

High success rates on frontier models using the proposed attack.

02

Encoding-based attacks remain a significant vulnerability.

03

Automated composition sampling enhances attack diversity.

Abstract

Large language models (LLMs) remain vulnerable to a slew of adversarial attacks and jailbreaking methods. One common approach employed by white-hat attackers, or red-teamers, is to process model inputs and outputs using string-level obfuscations, which can include leetspeak, rotary ciphers, Base64, ASCII, and more. Our work extends these encoding-based attacks by unifying them in a framework of invertible string transformations. With invertibility, we can devise arbitrary string compositions, defined as sequences of transformations, that we can encode and decode end-to-end programmatically. We devise a automated best-of-n attack that samples from a combinatorially large number of string compositions. Our jailbreaks obtain competitive attack success rates on several leading frontier models when evaluated on HarmBench, highlighting that encoding-based attacks remain a persistent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics