Distract Large Language Models for Automatic Jailbreak Attack
Zeguan Xiao, Yan Yang, Guanhua Chen, Yun Chen

TL;DR
This paper introduces a novel black-box framework for automatically jailbreaking large language models, revealing vulnerabilities and emphasizing the need for improved defense strategies.
Contribution
It presents an innovative iterative optimization-based method for black-box jailbreak attacks, demonstrating superior effectiveness, scalability, and transferability over existing approaches.
Findings
The framework successfully jailbreaks various open-source and proprietary LLMs.
Existing defenses are ineffective against the proposed attack.
The study highlights the urgent need for better LLM safety measures.
Abstract
Extensive efforts have been made before the public release of Large language models (LLMs) to align their behaviors with human values. However, even meticulously aligned LLMs remain vulnerable to malicious manipulations such as jailbreaking, leading to unintended behaviors. In this work, we propose a novel black-box jailbreak framework for automated red teaming of LLMs. We designed malicious content concealing and memory reframing with an iterative optimization algorithm to jailbreak LLMs, motivated by the research about the distractibility and over-confidence phenomenon of LLMs. Extensive experiments of jailbreaking both open-source and proprietary LLMs demonstrate the superiority of our framework in terms of effectiveness, scalability and transferability. We also evaluate the effectiveness of existing jailbreak defense methods against our attack and highlight the crucial need to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital and Cyber Forensics · Hate Speech and Cyberbullying Detection · Cybercrime and Law Enforcement Studies
MethodsALIGN
