Cross-Modal Safety Alignment: Is textual unlearning all you need?

Trishna Chakraborty; Erfan Shayegani; Zikui Cai; Nael Abu-Ghazaleh; M. Salman Asif; Yue Dong; Amit K. Roy-Chowdhury; Chengyu Song

arXiv:2406.02575·cs.CL·October 15, 2025·2 cites

Cross-Modal Safety Alignment: Is textual unlearning all you need?

Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit K. Roy-Chowdhury, Chengyu Song

PDF

Open Access

TL;DR

This paper investigates whether unlearning in the textual domain alone can effectively improve safety in vision-language models, demonstrating significant reduction in attack success rates with minimal utility loss.

Contribution

The study shows that textual unlearning in vision-language models effectively reduces attack success rates, offering a simpler alternative to multi-modal safety training.

Findings

01

Textual unlearning reduces attack success rate to below 8%.

02

Unlearning with multi-modal data offers no benefits and increases computational costs.

03

Safety can be improved without extensive multi-modal dataset collection.

Abstract

Recent studies reveal that integrating new modalities into Large Language Models (LLMs), such as Vision-Language Models (VLMs), creates a new attack surface that bypasses existing safety training techniques like Supervised Fine-tuning (SFT) and Reinforcement Learning with Human Feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where, regardless of the combination of input modalities, all inputs are ultimately fused into the language space, we aim to explore whether unlearning solely in the textual domain can be effective for cross-modality safety alignment. Our evaluation across six datasets empirically demonstrates the transferability -- textual unlearning in VLMs significantly reduces the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRisk and Safety Analysis · Safety Systems Engineering in Autonomy · Software Reliability and Analysis Research

MethodsShrink and Fine-Tune