Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction
Yuting Huang, Chengyuan Liu, Yifeng Feng, Yiquan Wu, Chao Wu, Fei Wu, Kun Kuang

TL;DR
This paper introduces R2J, a learnable and transferable method for creating effective LLM jailbreaks by rewriting instructions, improving attack efficiency and transferability without explicit patterns, thus highlighting safety concerns.
Contribution
Proposes R2J, a novel black-box jailbreak approach that automatically rewrites instructions to attack LLMs, demonstrating improved efficiency and transferability over existing methods.
Findings
R2J effectively exploits LLM weaknesses with few queries.
Jailbreaks are transferable across datasets and models.
Rewriting instructions is a learnable, transferable attack strategy.
Abstract
As Large Language Models (LLMs) are widely applied in various domains, the safety of LLMs is increasingly attracting attention to avoid their powerful capabilities being misused. Existing jailbreak methods create a forced instruction-following scenario, or search adversarial prompts with prefix or suffix tokens to achieve a specific representation manually or automatically. However, they suffer from low efficiency and explicit jailbreak patterns, far from the real deployment of mass attacks to LLMs. In this paper, we point out that simply rewriting the original instruction can achieve a jailbreak, and we find that this rewriting approach is learnable and transferable. We propose the Rewrite to Jailbreak (R2J) approach, a transferable black-box jailbreak method to attack LLMs by iteratively exploring the weakness of the LLMs and automatically improving the attacking strategy. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLegal Education and Practice Innovations
MethodsSoftmax · Attention Is All You Need
