An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box   Jailbreaking without Affirmative Phrases via LLM as Optimizer

Weipeng Jiang; Zhenting Wang; Juan Zhai; Shiqing Ma; Zhengyu Zhao,; Chao Shen

arXiv:2408.11313·cs.AI·April 3, 2025

An Optimizable Suffix Is Worth A Thousand Templates: Efficient Black-box Jailbreaking without Affirmative Phrases via LLM as Optimizer

Weipeng Jiang, Zhenting Wang, Juan Zhai, Shiqing Ma, Zhengyu Zhao,, Chao Shen

PDF

Open Access 1 Repo

TL;DR

ECLIPSE introduces an efficient black-box jailbreaking approach that uses optimizable suffixes and self-reflection to generate harmful content with high success rates, outperforming existing methods in efficiency and effectiveness.

Contribution

The paper proposes ECLIPSE, a novel black-box jailbreaking method leveraging optimizable suffixes and iterative feedback, eliminating the need for manual templates or white-box access.

Findings

01

Achieves 0.92 attack success rate across multiple LLMs.

02

Reduces attack overhead by 83% compared to previous methods.

03

Surpasses GCG in effectiveness, matching template-based methods.

Abstract

Despite prior safety alignment efforts, mainstream LLMs can still generate harmful and unethical content when subjected to jailbreaking attacks. Existing jailbreaking methods fall into two main categories: template-based and optimization-based methods. The former requires significant manual effort and domain knowledge, while the latter, exemplified by Greedy Coordinate Gradient (GCG), which seeks to maximize the likelihood of harmful LLM outputs through token-level optimization, also encounters several limitations: requiring white-box access, necessitating pre-constructed affirmative phrase, and suffering from low efficiency. In this paper, we present ECLIPSE, a novel and efficient black-box jailbreaking method utilizing optimizable suffixes. Drawing inspiration from LLMs' powerful generation and optimization capabilities, we employ task prompts to translate jailbreaking goals into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lenijwp/eclipse
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Digital Media Forensic Detection · Adversarial Robustness in Machine Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Adam · Weight Decay · Dense Connections · Byte Pair Encoding · Softmax · Linear Layer