AmpleGCG: Learning a Universal and Transferable Generative Model of   Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs

Zeyi Liao; Huan Sun

arXiv:2404.07921·cs.CL·November 26, 2024·3 cites

AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs

Zeyi Liao, Huan Sun

PDF

Open Access 1 Repo 6 Models

TL;DR

This paper introduces AmpleGCG, a generative model that efficiently produces adversarial suffixes to jailbreak both open and closed large language models, significantly improving attack success rates and transferability.

Contribution

AmpleGCG is a novel generative approach that learns the distribution of adversarial suffixes, enabling rapid, universal, and transferable attacks on various LLMs, surpassing existing methods.

Findings

01

Achieves near 100% attack success on Llama-2-7B-chat and Vicuna-7B.

02

Transfers seamlessly to attack GPT-3.5 with 99% success.

03

Generates 200 suffixes in 4 seconds, increasing attack efficiency.

Abstract

As large language models (LLMs) become increasingly prevalent and integrated into autonomous systems, ensuring their safety is imperative. Despite significant strides toward safety alignment, recent work GCG~\citep{zou2023universal} proposes a discrete token optimization algorithm and selects the single suffix with the lowest loss to successfully jailbreak aligned LLMs. In this work, we first discuss the drawbacks of solely picking the suffix with the lowest loss during GCG optimization for jailbreaking and uncover the missed successful suffixes during the intermediate steps. Moreover, we utilize those successful suffixes as training data to learn a generative model, named AmpleGCG, which captures the distribution of adversarial suffixes given a harmful query and enables the rapid generation of hundreds of suffixes for any harmful queries in seconds. AmpleGCG achieves near 100\% attack…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

osu-nlp-group/amplegcg
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Cosine Annealing · Linear Layer · Layer Normalization · Dense Connections · Attention Dropout · Residual Connection · Linear Warmup With Cosine Annealing