Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Qizhang Li; Yiwen Guo; Wangmeng Zuo; Hao Chen

arXiv:2405.20778·cs.CR·November 4, 2024

Improved Generation of Adversarial Examples Against Safety-aligned LLMs

Qizhang Li, Yiwen Guo, Wangmeng Zuo, Hao Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel approach combining transfer-based attack ideologies with gradient-based adversarial prompt generation, significantly improving attack success rates against safety-aligned LLMs without extra computational costs.

Contribution

It adapts transfer-based attack methods like Skip Gradient Method and Intermediate Level Attack to gradient-based prompt generation, achieving higher success rates against LLMs.

Findings

01

87% success rate in inducing target outputs with adversarial suffixes

02

33% higher match rate compared to baseline GCG

03

>30% increase in attack success rates for both query-specific and universal prompts

Abstract

Adversarial prompts generated using gradient-based methods exhibit outstanding performance in performing automatic jailbreak attacks against safety-aligned LLMs. Nevertheless, due to the discrete nature of texts, the input gradient of LLMs struggles to precisely reflect the magnitude of loss change that results from token replacements in the prompt, leading to limited attack success rates against safety-aligned LLMs, even in the white-box setting. In this paper, we explore a new perspective on this problem, suggesting that it can be alleviated by leveraging innovations inspired in transfer-based attacks that were originally proposed for attacking black-box image classification models. For the first time, we appropriate the ideologies of effective methods among these transfer-based attacks, i.e., Skip Gradient Method and Intermediate Level Attack, into gradient-based adversarial prompt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

qizhangli/Gradient-based-Jailbreak-Attacks
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSecurity and Verification in Computing · Adversarial Robustness in Machine Learning · Software Testing and Debugging Techniques