Improved Techniques for Optimization-Based Jailbreaking on Large   Language Models

Xiaojun Jia; Tianyu Pang; Chao Du; Yihao Huang; Jindong Gu; Yang Liu,; Xiaochun Cao; Min Lin

arXiv:2405.21018·cs.LG·June 6, 2024·3 cites

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu,, Xiaochun Cao, Min Lin

PDF

Open Access 1 Repo

TL;DR

This paper introduces improved optimization-based techniques for jailbreaking large language models, significantly increasing attack success rates by diversifying templates and optimizing update strategies, thus advancing safety testing methods.

Contribution

The paper proposes novel empirical improvements to GCG, including diverse target templates and adaptive multi-coordinate updates, resulting in a more effective jailbreak method called I-GCG.

Findings

01

Achieves nearly 100% attack success rate on benchmarks

02

Outperforms state-of-the-art jailbreaking attacks

03

Demonstrates effectiveness of diversified templates and adaptive strategies

Abstract

Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jiaxiaojunqaq/i-gcg
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Hate Speech and Cyberbullying Detection