Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models
Yi Ding, Lijun Li, Bing Cao, Jing Shao

TL;DR
This paper introduces the Multi-Image Safety (MIS) dataset to improve safety visual reasoning in vision-language models, significantly enhancing safety performance without sacrificing general capabilities.
Contribution
The paper presents a novel multi-image safety dataset and demonstrates that fine-tuning models with MIS improves safety reasoning and performance in safety-critical tasks.
Findings
Fine-tuning with MIS improves safety reasoning accuracy.
MIS reduces attack success rates on safety benchmarks.
Model performance on general benchmarks is preserved.
Abstract
Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image…
Peer Reviews
Decision·ICLR 2026 Poster
1. The MIS dataset is the first to focus on multi-image safety scenarios, a necessary and complex domain. The proposed MIRage, which uses safety CoT labels to teach reasoning rather than just refusal, is an effective solution to the problem. The dataset can be a useful resource to the community. 2. The method achieves a good balance of helpfulness and harmlessness. The results in Table 4 (near-0% ASR on their benchmark) and Table 6 (a slight increase in average accuracy on general benchmarks) sh
1. In Finding 1 and Discussion 1, the authors state that “fine-tuning models on such data leads to over-prudence on visual features, causing the model to reject benign visual inputs.” However, the exact experimental configuration for VLGuard-P is unclear. In the original VLGuard post-hoc fine-tuning setup, 5k general helpfulness samples were included. Therefore, two concerns arise: (1) if this experiment did not follow the same configuration, the conclusion may not be directly comparable; or (2)
- The paper addresses an interesting and important topic: safety gaps in multi-image VLMs, where it shows the benefits of COT reasoning to address these limitations. - It introduces an interesting small dataset (MIS). - The paper studies an interesting safety-loophole in in VLMs using two neutral images + text; in this scope it is well-framed, and building on existing work where neutral text and single images can still produce unsafe outputs. - Overall, the experiments and evaluations are thorou
The paper shows several weaknesses hampering generalizability of the approach aimed at addressing "a significant bottleneck in the safety capabilities of existing safeguarding methods" [line 045]: - The focus is solely on two images + text (appears limited to generalize this claim) - Along this line the datasets size also appears limited, especially for the smaller categories (e.g. self-harm, privacy), with only a few hundred images. - The method is only evaluated on internally trained models a
1. The paper systematically diagnoses and analyzes current safety fine-tuning bottlenecks through comprehensive experiments. 2. Authors present MIS, the first multi-image safety dataset, featuring a training split aimed at enhancing models’ safety-related visual perception and reasoning abilities. 3. Models finetuned with MIRage generalize well to previously unseen safety categories.
1. The “Safety CoT” template is interesting but not well formalized or analyzed—it’s unclear how much of the improvement comes from reasoning versus simply from data diversity. The authors should include ablations to isolate the impact of Safety CoT vs. multi-image input vs. fine-tuning data scale. 2. While MIS and MIRage are presented as key contributions, the fine-tuning pipeline itself largely follows a standard SFT paradigm. There is limited method innovation in model architecture or trainin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies · Natural Language Processing Techniques
MethodsFocus
