Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter,, Matt Fredrikson

TL;DR
This paper introduces a simple, automatic, and transferable adversarial attack method that effectively induces objectionable content in aligned language models, including black-box and publicly available models, challenging current alignment defenses.
Contribution
The authors propose a novel automatic attack technique that generates transferable suffix prompts to bypass alignment measures in large language models, improving over previous manual and automatic methods.
Findings
Adversarial suffixes can induce objectionable content across multiple models.
The attack method is effective against both open-source and commercial LLMs.
Generated prompts are highly transferable, including to black-box models like ChatGPT and Bard.
Abstract
Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗meta-llama/Llama-Guard-3-11B-Visionmodel· 2.4k dl· ♡ 702.4k dl♡ 70
- 🤗SinclairSchneider/Llama-Guard-3-11B-Visionmodel· 7 dl· ♡ 27 dl♡ 2
- 🤗recursivelabsai/model-evaluation-infrastructuremodel
- 🤗CTCT-CT2/changeway_guardrailsmodel· 10 dl· ♡ 210 dl♡ 2
- 🤗Repoaner/llama_guard_visionmodel· 1 dl1 dl
- 🤗DavidTKeane/cyberranger-v42model· 51 dl· ♡ 151 dl♡ 1
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques
MethodsPythia
