Self-Aware Safety Augmentation: Leveraging Internal Semantic Understanding to Enhance Safety in Vision-Language Models
Wanying Wang, Zeyu Ma, Han Zheng, Xin Tan, Mingang Chen

TL;DR
This paper introduces Self-Aware Safety Augmentation (SASA), a method that uses internal semantic understanding of vision-language models to improve safety recognition without fine-tuning, based on insights into their internal safety perception and semantic capabilities.
Contribution
The paper reveals the internal dynamics of LVLMs' safety perception and proposes SASA, a novel technique that enhances safety by projecting semantic representations onto safety layers without additional training.
Findings
SASA significantly improves model safety across multiple datasets.
Safety perception often precedes semantic understanding in models.
SASA maintains utility with minimal performance impact.
Abstract
Large vision-language models (LVLMs) are vulnerable to harmful input compared to their language-only backbones. We investigated this vulnerability by exploring LVLMs internal dynamics, framing their inherent safety understanding in terms of three key capabilities. Specifically, we define these capabilities as safety perception, semantic understanding, and alignment for linguistic expression, and experimentally pinpointed their primary locations within the model architecture. The results indicate that safety perception often emerges before comprehensive semantic understanding, leading to the reduction in safety. Motivated by these findings, we propose \textbf{Self-Aware Safety Augmentation (SASA)}, a technique that projects informative semantic representations from intermediate layers onto earlier safety-oriented layers. This approach leverages the model's inherent semantic understanding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
