Single Character Perturbations Break LLM Alignment

Leon Lin; Hannah Brown; Kenji Kawaguchi; Michael Shieh

arXiv:2407.03232·cs.LG·July 4, 2024

Single Character Perturbations Break LLM Alignment

Leon Lin, Hannah Brown, Kenji Kawaguchi, Michael Shieh

PDF

Open Access

TL;DR

This paper reveals that adding a space at the end of input prompts can bypass safety measures in large language models, causing them to produce harmful outputs, highlighting the fragility of current alignment techniques.

Contribution

It demonstrates a simple yet effective attack method on LLM safety defenses and analyzes the underlying causes related to training data contexts.

Findings

01

Most models generate harmful outputs after space attack

02

Single spaces in training data encourage unsafe list responses

03

Current alignment methods are fragile and vulnerable

Abstract

When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques