X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Xiaoya Lu; Dongrui Liu; Yi Yu; Luxin Xu; Jing Shao

arXiv:2502.09990·cs.CR·December 29, 2025

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao

PDF

Open Access 1 Repo

TL;DR

This paper introduces X-Boundary, a novel method to precisely distinguish and erase harmful feature representations in LLMs, effectively defending against multi-turn jailbreaks without sacrificing usability or capabilities.

Contribution

X-Boundary establishes an exact safety boundary in LLMs, improving robustness against jailbreaks while preserving general capabilities and reducing over-refusal issues.

Findings

01

X-Boundary achieves state-of-the-art defense performance.

02

Reduces over-refusal rate by about 20%.

03

Maintains nearly complete general capability.

Abstract

Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ai45lab/x-boundary
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital and Cyber Forensics · Software Testing and Debugging Techniques · Adversarial Robustness in Machine Learning