Improved Generation of Adversarial Examples Against Safety-aligned LLMs
Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen

TL;DR
This paper introduces a novel approach combining transfer-based attack ideologies with gradient-based adversarial prompt generation, significantly improving attack success rates against safety-aligned LLMs without extra computational costs.
Contribution
It adapts transfer-based attack methods like Skip Gradient Method and Intermediate Level Attack to gradient-based prompt generation, achieving higher success rates against LLMs.
Findings
87% success rate in inducing target outputs with adversarial suffixes
33% higher match rate compared to baseline GCG
>30% increase in attack success rates for both query-specific and universal prompts
Abstract
Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i.e., Skip Gradient Method and Intermediate Level Attack, into gradient-based adversarial prompt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Software Testing and Debugging Techniques
