AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models
Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang,, Furong Huang, Ani Nenkova, Tong Sun

TL;DR
AutoDAN is a gradient-based adversarial attack method that generates readable, interpretable prompts to bypass filters and reveal vulnerabilities in large language models, enhancing understanding and red-teaming capabilities.
Contribution
AutoDAN introduces a novel, interpretable, gradient-based attack that produces readable prompts, combining strengths of manual and automatic jailbreak strategies, and demonstrating improved transferability and versatility.
Findings
AutoDAN generates readable prompts that bypass perplexity filters.
AutoDAN's prompts transfer better to black-box models with limited data.
AutoDAN can automatically leak system prompts using customized objectives.
Abstract
Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number due to the necessity of human creativity allows for easy blocking. In this paper, we show that these solutions may be too optimistic. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types. Guided by the dual goals of jailbreak and readability, AutoDAN optimizes and generates tokens one by one from left to right, resulting in readable prompts that bypass perplexity filters while maintaining high attack success rates. Notably, these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
