AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy Vorobeychik, Zhuoqing, Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, Chaowei Xiao

TL;DR
AutoDAN-Turbo is an automated, black-box approach that discovers and utilizes jailbreak strategies to effectively attack large language models, significantly surpassing existing methods in success rate without human input.
Contribution
It introduces AutoDAN-Turbo, a novel framework that automatically finds and applies jailbreak strategies, outperforming baseline methods and integrating human-designed strategies for enhanced effectiveness.
Findings
Achieves 74.3% higher attack success rate than baselines.
Reaches 88.5% success on GPT-4-1106-turbo.
Improves to 93.4% success when incorporating human strategies.
Abstract
In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDigital and Cyber Forensics · Cybercrime and Law Enforcement Studies · Advanced Malware Detection Techniques
