X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability
Xiaoya Lu, Dongrui Liu, Yi Yu, Luxin Xu, Jing Shao

TL;DR
This paper introduces X-Boundary, a novel method to precisely distinguish and erase harmful feature representations in LLMs, effectively defending against multi-turn jailbreaks without sacrificing usability or capabilities.
Contribution
X-Boundary establishes an exact safety boundary in LLMs, improving robustness against jailbreaks while preserving general capabilities and reducing over-refusal issues.
Findings
X-Boundary achieves state-of-the-art defense performance.
Reduces over-refusal rate by about 20%.
Maintains nearly complete general capability.
Abstract
Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital and Cyber Forensics · Software Testing and Debugging Techniques · Adversarial Robustness in Machine Learning
