Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Lang Gao; Jiahui Geng; Xiangliang Zhang; Preslav Nakov; Xiuying Chen

arXiv:2412.17034·cs.CL·May 22, 2025

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models

Lang Gao, Jiahui Geng, Xiangliang Zhang, Preslav Nakov, Xiuying Chen

PDF

Open Access 1 Video

TL;DR

This paper analyzes jailbreak methods in Large Language Models, introduces the concept of safety boundaries, and proposes Activation Boundary Defense (ABD) to effectively prevent harmful outputs with minimal impact on performance.

Contribution

It provides a large-scale analysis of jailbreak techniques, introduces the safety boundary concept, and develops a novel adaptive defense method called ABD.

Findings

01

Jailbreaks shift harmful activations outside the safety boundary.

02

Low and middle layers are critical in activation shifts.

03

ABD achieves over 98% defense success with minimal performance impact.

Abstract

Jailbreaking in Large Language Models (LLMs) is a major security concern as it can deceive LLMs to generate harmful text. Yet, there is still insufficient understanding of how jailbreaking works, which makes it hard to develop effective defense strategies. We aim to shed more light into this issue: we conduct a detailed large-scale analysis of seven different jailbreak methods and find that these disagreements stem from insufficient observation samples. In particular, we introduce \textit{safety boundary}, and we find that jailbreaks shift harmful activations outside that safety boundary, where LLMs are less sensitive to harmful information. We also find that the low and the middle layers are critical in such shifts, while deeper layers have less impact. Leveraging on these insights, we propose a novel defense called \textbf{Activation Boundary Defense} (ABD), which adaptively…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Shaping the Safety Boundaries: Understanding and Defending Against Jailbreaks in Large Language Models· underline

Taxonomy

TopicsArtificial Intelligence in Law · Ethics and Social Impacts of AI · Digital and Cyber Forensics