Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu; Yi Xie; Shiqian Zhao; Xiaofeng Chen

arXiv:2603.05772·cs.CR·March 16, 2026

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Jinman Wu, Yi Xie, Shiqian Zhao, Xiaofeng Chen

PDF

Open Access

TL;DR

This paper introduces SAHA, a novel attack framework targeting deep attention heads in large language models, revealing vulnerabilities that enable more effective jailbreak attacks to generate unsafe content.

Contribution

SAHA is the first attention-head-level jailbreak method that identifies and exploits vulnerabilities in deeper, less aligned attention layers of large language models.

Findings

01

SAHA improves attack success rate by 14% over state-of-the-art methods.

02

Deeper attention layers are more vulnerable to jailbreak attacks.

03

Boundary-aware perturbation effectively probes unsafe content generation.

Abstract

Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Advanced Malware Detection Techniques · Hate Speech and Cyberbullying Detection