Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang; Shu Yang; Lijie Hu; Di Wang

arXiv:2603.08486·cs.CV·April 16, 2026

Visual Self-Fulfilling Alignment: Shaping Safety-Oriented Personas via Threat-Related Images

Qishun Yang, Shu Yang, Lijie Hu, Di Wang

PDF

TL;DR

This paper introduces Visual Self-Fulfilling Alignment (VSFA), a label-free method that fine-tunes vision-language models on threat-related images to promote safety-oriented responses without explicit safety labels.

Contribution

It extends the self-fulfilling mechanism to visual modalities, enabling models to develop safety personas through exposure to threat-related visuals without safety labels.

Findings

01

VSFA reduces attack success rate on safety benchmarks.

02

Models show improved response quality and reduced over-refusal.

03

Approach preserves general capabilities of vision-language models.

Abstract

Multimodal large language models (MLLMs) face safety misalignment, where visual inputs enable harmful outputs. To address this, existing methods require explicit safety labels or contrastive data; yet, threat-related concepts are concrete and visually depictable, while safety concepts, like helpfulness, are abstract and lack visual referents. Inspired by the Self-Fulfilling mechanism underlying emergent misalignment, we propose Visual Self-Fulfilling Alignment (VSFA). VSFA fine-tunes vision-language models (VLMs) on neutral VQA tasks constructed around threat-related images, without any safety labels. Through repeated exposure to threat-related visual content, models internalize the implicit semantics of vigilance and caution, shaping safety-oriented personas. Experiments across multiple VLMs and safety benchmarks demonstrate that VSFA reduces the attack success rate, improves response…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.