Safety Alignment Should Be Made More Than Just A Few Attention Heads
Chao Huang, Zefeng Zhang, Juewei Yue, Quangang Li, Chuang Zhang, Tingwen Liu

TL;DR
This paper reveals that safety mechanisms in large language models rely heavily on a few attention heads, making them vulnerable to attacks, and proposes a training method to distribute safety functions more broadly for improved robustness.
Contribution
The paper introduces RDSHA for identifying safety-critical attention heads and AHD, a training strategy to distribute safety behaviors across many heads, enhancing model robustness.
Findings
Ablation of key attention heads compromises safety.
AHD distributes safety behaviors across more heads.
Models with AHD resist jailbreak attacks better.
Abstract
Current safety alignment for large language models(LLMs) continues to present vulnerabilities, given that adversarial prompting can effectively bypass their safety measures.Our investigation shows that these safety mechanisms predominantly depend on a limited subset of attention heads: removing or ablating these heads can severely compromise model safety. To identify and evaluate these safety-critical components, we introduce RDSHA, a targeted ablation method that leverages the model's refusal direction to pinpoint attention heads mostly responsible for safety behaviors. Further analysis shows that existing jailbreak attacks exploit this concentration by selectively bypassing or manipulating these critical attention heads. To address this issue, we propose AHD, a novel training strategy designed to promote the distributed encoding of safety-related behaviors across numerous attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
