An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer
Weipeng Jiang, Zhenting Wang, Juan Zhai, Shiqing Ma, Zhengyu Zhao,, Chao Shen

TL;DR
ECLIPSE introduces an efficient black-box jailbreaking approach that uses optimizable suffixes and self-reflection to generate harmful content with high success rates, outperforming existing methods in efficiency and effectiveness.
Contribution
The paper proposes ECLIPSE, a novel black-box jailbreaking method leveraging optimizable suffixes and iterative feedback, eliminating the need for manual templates or white-box access.
Findings
Achieves 0.92 attack success rate across multiple LLMs.
Reduces attack overhead by 83% compared to previous methods.
Surpasses GCG in effectiveness, matching template-based methods.
Abstract
Despite prior safety alignment efforts, mainstream LLMs can still generate harmful and unethical content when subjected to jailbreaking attacks. Existing jailbreaking methods fall into two main categories: template-based and optimization-based methods. The former requires significant manual effort and domain knowledge, while the latter, exemplified by Greedy Coordinate Gradient (GCG), which seeks to maximize the likelihood of harmful LLM outputs through token-level optimization, also encounters several limitations: requiring white-box access, necessitating pre-constructed affirmative phrase, and suffering from low efficiency. In this paper, we present ECLIPSE, a novel and efficient black-box jailbreaking method utilizing optimizable suffixes. Drawing inspiration from LLMs' powerful generation and optimization capabilities, we employ task prompts to translate jailbreaking goals into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Digital Media Forensic Detection · Adversarial Robustness in Machine Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Adam · Weight Decay · Dense Connections · Byte Pair Encoding · Softmax · Linear Layer
