Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large   Vision-Language Models

Ziwei Zheng; Junyao Zhao; Le Yang; Lijun He; Fan Li

arXiv:2501.02029·cs.LG·January 7, 2025

Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models

Ziwei Zheng, Junyao Zhao, Le Yang, Lijun He, Fan Li

PDF

Open Access

TL;DR

This paper uncovers specific attention heads in large vision-language models that serve as internal safety mechanisms, enabling effective detection of malicious prompts with minimal overhead, thus enhancing model safety.

Contribution

It identifies and analyzes safety attention heads in LVLMs, demonstrating their role as shields and proposing a simple detector for malicious prompts with strong zero-shot generalization.

Findings

01

Safety heads effectively identify malicious prompts.

02

Ablating safety heads increases attack success rates.

03

The proposed detector generalizes well zero-shot.

Abstract

With the integration of an additional modality, large vision-language models (LVLMs) exhibit greater vulnerability to safety risks (e.g., jailbreaking) compared to their language-only predecessors. Although recent studies have devoted considerable effort to the post-hoc alignment of LVLMs, the inner safety mechanisms remain largely unexplored. In this paper, we discover that internal activations of LVLMs during the first token generation can effectively identify malicious prompts across different attacks. This inherent safety perception is governed by sparse attention heads, which we term ``safety heads." Further analysis reveals that these heads act as specialized shields against malicious prompts; ablating them leads to higher attack success rates, while the model's utility remains unaffected. By locating these safety heads and concatenating their activations, we construct a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman-Automation Interaction and Safety · Safety Warnings and Signage

MethodsSoftmax · Attention Is All You Need · Logistic Regression