Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu,, Xiaochun Cao, Min Lin

TL;DR
This paper introduces improved optimization-based techniques for jailbreaking large language models, significantly increasing attack success rates by diversifying templates and optimizing update strategies, thus advancing safety testing methods.
Contribution
The paper proposes novel empirical improvements to GCG, including diverse target templates and adaptive multi-coordinate updates, resulting in a more effective jailbreak method called I-GCG.
Findings
Achieves nearly 100% attack success rate on benchmarks
Outperforms state-of-the-art jailbreaking attacks
Demonstrates effectiveness of diversified templates and adaptive strategies
Abstract
Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Hate Speech and Cyberbullying Detection
