Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries

Jiahao Yu; Haozheng Luo; Jerry Yao-Chieh Hu; Wenbo Guo; Han Liu; Xinyu Xing

arXiv:2405.20653·cs.AI·June 18, 2025·1 cites

Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries

Jiahao Yu, Haozheng Luo, Jerry Yao-Chieh Hu, Wenbo Guo, Han Liu, Xinyu Xing

PDF

Open Access

TL;DR

This paper uncovers a hidden vulnerability in aligned LLMs where appending eos tokens can cause context segmentation, significantly boosting jailbreak attack success rates and exposing weaknesses in current alignment and filtering methods.

Contribution

It reveals a novel vulnerability called context segmentation caused by eos tokens and proposes methods to enhance defenses against jailbreak attacks in large language models.

Findings

01

Appending eos tokens increases attack success rates across multiple jailbreak techniques.

02

Major API providers do not filter eos tokens, making their models vulnerable.

03

The vulnerability can be exploited to bypass existing content filtering and alignment defenses.

Abstract

Recent advances in Large Language Models (LLMs) have led to impressive alignment where models learn to distinguish harmful from harmless queries through supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). In this paper, we reveal a subtle yet impactful weakness in these aligned models. We find that simply appending multiple end of sequence (eos) tokens can cause a phenomenon we call context segmentation, which effectively shifts both harmful and benign inputs closer to the refusal boundary in the hidden space. Building on this observation, we propose a straightforward method to BOOST jailbreak attacks by appending eos tokens. Our systematic evaluation shows that this strategy significantly increases the attack success rate across 8 representative jailbreak techniques and 16 open-source LLMs, ranging from 2B to 72B parameters. Moreover, we develop a novel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Digital and Cyber Forensics · Hate Speech and Cyberbullying Detection