Universal and Transferable Adversarial Attacks on Aligned Language   Models

Andy Zou; Zifan Wang; Nicholas Carlini; Milad Nasr; J. Zico Kolter,; Matt Fredrikson

arXiv:2307.15043·cs.CL·December 22, 2023·183 cites

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter,, Matt Fredrikson

PDF

Open Access 5 Repos 6 Models 5 Datasets

TL;DR

This paper introduces a simple, automatic, and transferable adversarial attack method that effectively induces objectionable content in aligned language models, including black-box and publicly available models, challenging current alignment defenses.

Contribution

The authors propose a novel automatic attack technique that generates transferable suffix prompts to bypass alignment measures in large language models, improving over previous manual and automatic methods.

Findings

01

Adversarial suffixes can induce objectionable content across multiple models.

02

The attack method is effective against both open-source and commercial LLMs.

03

Generated prompts are highly transferable, including to black-box models like ChatGPT and Bard.

Abstract

Because "out-of-the-box" large language models are capable of generating a great deal of objectionable content, recent work has focused on aligning these models in an attempt to prevent undesirable generation. While there has been some success at circumventing these measures -- so-called "jailbreaks" against LLMs -- these attacks have required significant human ingenuity and are brittle in practice. In this paper, we propose a simple and effective attack method that causes aligned language models to generate objectionable behaviors. Specifically, our approach finds a suffix that, when attached to a wide range of queries for an LLM to produce objectionable content, aims to maximize the probability that the model produces an affirmative response (rather than refusing to answer). However, instead of relying on manual engineering, our approach automatically produces these adversarial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Natural Language Processing Techniques

MethodsPythia