Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models
Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li

TL;DR
This paper uncovers specific attention heads in large vision-language models that serve as internal safety mechanisms, enabling effective detection of malicious prompts with minimal overhead, thus enhancing model safety.
Contribution
It identifies and analyzes safety attention heads in LVLMs, demonstrating their role as shields and proposing a simple detector for malicious prompts with strong zero-shot generalization.
Findings
Safety heads effectively identify malicious prompts.
Ablating safety heads increases attack success rates.
The proposed detector generalizes well zero-shot.
Abstract
With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman-Automation Interaction and Safety · Safety Warnings and Signage
MethodsSoftmax · Attention Is All You Need · Logistic Regression
