DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt
Yitong Zhang, Jia Li, Liyi Cai, Ge Li

TL;DR
This paper introduces DAVSP, a novel safety alignment method for large vision-language models that uses a visual safety prompt and deep alignment to resist malicious queries while maintaining utility on benign inputs.
Contribution
The paper proposes a new safety alignment technique combining a visual safety prompt with deep supervision, improving resistance to malicious queries in LVLMs.
Findings
Effective resistance to malicious queries across five benchmarks
Preserves utility on benign inputs
Exhibits strong cross-model generation ability
Abstract
Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries that exploit the visual modality. Existing alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose Deep Aligned Visual Safety Prompt (DAVSP), which is built upon two key innovations. First, we introduce the Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments across five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
