Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents
Idhant Gulati, Shivam Raval

TL;DR
Fine-tuning vision-language models on narrow harmful datasets causes significant and broad misalignment, which is difficult to fully mitigate, raising concerns about safety in continual learning scenarios.
Contribution
This paper demonstrates that narrow-domain fine-tuning induces broad misalignment in vision-language models and explores mitigation strategies, highlighting challenges in maintaining safety alignment.
Findings
Misalignment scales with LoRA rank and is higher in multimodal evaluation.
Even 10 ext{ } of harmful data causes significant degradation.
Harmful behaviors are low-dimensional, captured in 10 principal components.
Abstract
Lifelong multimodal agents must continuously adapt to new tasks through post-training, but this creates a fundamental tension between acquiring capabilities and preserving safety alignment. We demonstrate that fine-tuning aligned vision-language models on narrow-domain harmful datasets induces severe emergent misalignment that generalizes broadly across unrelated tasks and modalities. Through experiments on Gemma3-4B, we show that misalignment scales monotonically with LoRA rank, and that multimodal evaluation reveals substantially higher misalignment ( at ) than text-only evaluation (), suggesting that unimodal safety benchmarks may underestimate alignment degradation in vision-language models. Critically, even 10\% harmful data in the training mixture induces substantial alignment degradation. Geometric analysis reveals that harmful behaviors…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
