Baseline Defenses for Adversarial Attacks Against Aligned Language   Models

Neel Jain; Avi Schwarzschild; Yuxin Wen; Gowthami Somepalli; John; Kirchenbauer; Ping-yeh Chiang; Micah Goldblum; Aniruddha Saha; Jonas Geiping,; Tom Goldstein

arXiv:2309.00614·cs.LG·September 6, 2023·36 cites

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John, Kirchenbauer, Ping-yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping,, Tom Goldstein

PDF

Open Access 1 Repo

TL;DR

This paper evaluates baseline defense strategies against adversarial attacks on large language models, analyzing their effectiveness, practicality, and the unique security challenges compared to computer vision.

Contribution

It provides a systematic evaluation of detection, preprocessing, and adversarial training defenses for LLMs, highlighting their strengths and limitations in various threat models.

Findings

01

Existing text optimizers are weak and costly, making adaptive attacks challenging.

02

Filtering and preprocessing defenses show promise in improving LLM robustness.

03

Further research needed to develop stronger optimizers or defenses in the LLM domain.

Abstract

As Large Language Models quickly become ubiquitous, it becomes critical to understand their security vulnerabilities. Recent work shows that text optimizers can produce jailbreaking prompts that bypass moderation and alignment. Drawing from the rich body of work on adversarial machine learning, we approach these attacks with three questions: What threat models are practically useful in this domain? How do baseline defense techniques perform in this new domain? How does LLM security differ from computer vision? We evaluate several baseline defense strategies against leading adversarial attacks on LLMs, discussing the various settings in which each is feasible and effective. Particularly, we look at three types of defenses: detection (perplexity based), input preprocessing (paraphrase and retokenization), and adversarial training. We discuss white-box and gray-box settings and discuss…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

neelsjain/baseline-defenses
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Natural Language Processing Techniques