Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries
Jiahao Yu, Haozheng Luo, Jerry Yao-Chieh Hu, Wenbo Guo, Han Liu, Xinyu Xing

TL;DR
This paper uncovers a hidden vulnerability in aligned LLMs where appending eos tokens can cause context segmentation, significantly boosting jailbreak attack success rates and exposing weaknesses in current alignment and filtering methods.
Contribution
It reveals a novel vulnerability called context segmentation caused by eos tokens and proposes methods to enhance defenses against jailbreak attacks in large language models.
Findings
Appending eos tokens increases attack success rates across multiple jailbreak techniques.
Major API providers do not filter eos tokens, making their models vulnerable.
The vulnerability can be exploited to bypass existing content filtering and alignment defenses.
Abstract
Recent advances in Large Language Models (LLMs) have led to impressive alignment where models learn to distinguish harmful from harmless queries through supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). In this paper, we reveal a subtle yet impactful weakness in these aligned models. We find that simply appending multiple end of sequence (eos) tokens can cause a phenomenon we call context segmentation, which effectively shifts both harmful and benign inputs closer to the refusal boundary in the hidden space. Building on this observation, we propose a straightforward method to BOOST jailbreak attacks by appending eos tokens. Our systematic evaluation shows that this strategy significantly increases the attack success rate across 8 representative jailbreak techniques and 16 open-source LLMs, ranging from 2B to 72B parameters. Moreover, we develop a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Hate Speech and Cyberbullying Detection
