Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models
Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, Xiuying Chen

TL;DR
This paper analyzes jailbreak methods in Large Language Models, introduces the concept of safety boundaries, and proposes Activation Boundary Defense (ABD) to effectively prevent harmful outputs with minimal impact on performance.
Contribution
It provides a large-scale analysis of jailbreak techniques, introduces the safety boundary concept, and develops a novel adaptive defense method called ABD.
Findings
Jailbreaks shift harmful activations outside the safety boundary.
Low and middle layers are critical in activation shifts.
ABD achieves over 98% defense success with minimal performance impact.
Abstract
Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsArtificial Intelligence in Law · Ethics and Social Impacts of AI · Digital and Cyber Forensics
