Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images
Qishun Yang, Shu Yang, Lijie Hu, Di Wang

TL;DR
This paper introduces Visual Self-Fulfilling Alignment (VSFA), a label-free method that fine-tunes vision-language models on threat-related images to promote safety-oriented responses without explicit safety labels.
Contribution
It extends the self-fulfilling mechanism to visual modalities, enabling models to develop safety personas through exposure to threat-related visuals without safety labels.
Findings
VSFA reduces attack success rate on safety benchmarks.
Models show improved response quality and reduced over-refusal.
Approach preserves general capabilities of vision-language models.
Abstract
Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
