On the Role of Attention Heads in Large Language Model Safety
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Kun, Wang, Yang Liu, Junfeng Fang, Yongbin Li

TL;DR
This paper investigates how individual attention heads in large language models influence safety, introducing new metrics and algorithms to identify critical safety-related heads, revealing their significant impact on model safety responses.
Contribution
The paper introduces the Safety Head ImPortant Score (Ships) and the Safety Attention Head AttRibution Algorithm (Sahara) to quantify and attribute safety-critical attention heads in large language models.
Findings
A single safety head affects model safety responses significantly.
Ablating one safety head increases harmful responses by 16 times.
Safety heads tend to function as feature extractors for safety across models.
Abstract
Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose a novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads' contributions to model safety. Based on this, we…
Peer Reviews
Decision·ICLR 2025 Oral
1. To my knowledge, no other work has attempted to interpret each attention head’s contributions to LLM safety. I'm glad to see such work. 2. The method for assessing the importance of attention heads to safety is intuitive and reasonable.
1. Other previous works, such as those identifying safety-related parameters through probing [1], could also be discussed. 2. There are inconsistencies in notation usage that need to be addressed. To name a few: - In Eq 2, 7 and 8, $d_k$ denotes the model dimension. However in line 297, $d$ is used instead. - In Appendix A.1, $N = d/n$ should be clarified. - Throughout the paper, $L$ and $n$ are used to denote the number of layers and the number of heads, respectively, but Algorithm 1 uses $\ma
- The paper proposes a novel method for mechanistically locating and ablating heads that are important to safety alignment, with greater granularity and less compute than prior methods. The method of head ablation is well motivated and the experiments are detailed, considering course correction (reverting back to safety) as well. - The paper is well written and organized
1. The helpfulness / utility measurement is done with lm-eval, which mostly consists of single-turn question and answering utility measurement. More comprehensive utility measurement would benefit the paper. 2. Sahara uses heuristic to choose group size, and group size is important to how safety capability is affected. Such size heuristics (more than 3) might not hold for different models with different number of parameters. 3. The paper is overall well-written with some small typos: "Bottom.
1. The paper introduces a novel approach to understanding the safety mechanisms within large language models (LLMs) by presenting the Safety Head Important Score (Ships) and the Safety Attention Head AttRibution Algorithm (Sahara). 2. It effectively shifts the focus from generalized model parameters to specific attention heads that have a direct impact on the model's ability to reject harmful queries. 3. This work addresses a significant gap in the literature by systematically exploring the role
1. Lines 256 and Appendix B.3 indicate that the ASR metric used in this paper employs a keyword-detection method, which is noted in [1] as having limitations that “lead to false positive and false negative cases.” Why is the GPT4-judge method, validated in [1] as a more comprehensive and accurate metric, not used? This method is commonly employed in 2024 LLM safety papers to measure ASR. The inaccuracies of ASR based on keyword detection in assessing successful attacks weaken the experimental da
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling
MethodsAttention Is All You Need · Linear Layer · Softmax · Multi-Head Attention · Balanced Selection
