AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large   Language Models

Sicheng Zhu; Ruiyi Zhang; Bang An; Gang Wu; Joe Barrow; Zichao Wang,; Furong Huang; Ani Nenkova; Tong Sun

arXiv:2310.15140·cs.CR·December 15, 2023·5 cites

AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models

Sicheng Zhu, Ruiyi Zhang, Bang An, Gang Wu, Joe Barrow, Zichao Wang,, Furong Huang, Ani Nenkova, Tong Sun

PDF

Open Access 1 Repo 1 Datasets

TL;DR

AutoDAN is a gradient-based adversarial attack method that generates readable, interpretable prompts to bypass filters and reveal vulnerabilities in large language models, enhancing understanding and red-teaming capabilities.

Contribution

AutoDAN introduces a novel, interpretable, gradient-based attack that produces readable prompts, combining strengths of manual and automatic jailbreak strategies, and demonstrating improved transferability and versatility.

Findings

01

AutoDAN generates readable prompts that bypass perplexity filters.

02

AutoDAN's prompts transfer better to black-box models with limited data.

03

AutoDAN can automatically leak system prompts using customized objectives.

Abstract

Safety alignment of Large Language Models (LLMs) can be compromised with manual jailbreak attacks and (automatic) adversarial attacks. Recent studies suggest that defending against these attacks is possible: adversarial attacks generate unlimited but unreadable gibberish prompts, detectable by perplexity-based filters; manual jailbreak attacks craft readable prompts, but their limited number due to the necessity of human creativity allows for easy blocking. In this paper, we show that these solutions may be too optimistic. We introduce AutoDAN, an interpretable, gradient-based adversarial attack that merges the strengths of both attack types. Guided by the dual goals of jailbreak and readability, AutoDAN optimizes and generates tokens one by one from left to right, resulting in readable prompts that bypass perplexity filters while maintaining high attack success rates. Notably, these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rotaryhammer/code-autodan
pytorch

Datasets

furonghuang-lab/PHTest
dataset· 53 dl
53 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques