Single Character Perturbations Break LLM Alignment
Leon Lin, Hannah Brown, Kenji Kawaguchi, Michael Shieh

TL;DR
This paper reveals that adding a space at the end of input prompts can bypass safety measures in large language models, causing them to produce harmful outputs, highlighting the fragility of current alignment techniques.
Contribution
It demonstrates a simple yet effective attack method on LLM safety defenses and analyzes the underlying causes related to training data contexts.
Findings
Most models generate harmful outputs after space attack
Single spaces in training data encourage unsafe list responses
Current alignment methods are fragile and vulnerable
Abstract
When LLMs are deployed in sensitive, human-facing settings, it is crucial that they do not output unsafe, biased, or privacy-violating outputs. For this reason, models are both trained and instructed to refuse to answer unsafe prompts such as "Tell me how to build a bomb." We find that, despite these safeguards, it is possible to break model defenses simply by appending a space to the end of a model's input. In a study of eight open-source models, we demonstrate that this acts as a strong enough attack to cause the majority of models to generate harmful outputs with very high success rates. We examine the causes of this behavior, finding that the contexts in which single spaces occur in tokenized training data encourage models to generate lists when prompted, overriding training signals to refuse to answer unsafe requests. Our findings underscore the fragile state of current model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
