Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models

Yi Ding; Lijun Li; Bing Cao; Jing Shao

arXiv:2501.18533·cs.CV·February 4, 2026

Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models

Yi Ding, Lijun Li, Bing Cao, Jing Shao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces the Multi-Image Safety (MIS) dataset to improve safety visual reasoning in vision-language models, significantly enhancing safety performance without sacrificing general capabilities.

Contribution

The paper presents a novel multi-image safety dataset and demonstrates that fine-tuning models with MIS improves safety reasoning and performance in safety-critical tasks.

Findings

01

Fine-tuning with MIS improves safety reasoning accuracy.

02

MIS reduces attack success rates on safety benchmarks.

03

Model performance on general benchmarks is preserved.

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable performance across a wide range of tasks. However, their deployment in safety-critical domains poses significant challenges. Existing safety fine-tuning methods, which focus on textual or multimodal content, fall short in addressing challenging cases or disrupt the balance between helpfulness and harmlessness. Our evaluation highlights a safety reasoning gap: these methods lack safety visual reasoning ability, leading to such bottlenecks. To address this limitation and enhance both visual perception and reasoning in safety-critical contexts, we propose a novel dataset that integrates multi-image inputs with safety Chain-of-Thought (CoT) labels as fine-grained reasoning logic to improve model performance. Specifically, we introduce the Multi-Image Safety (MIS) dataset, an instruction-following dataset tailored for multi-image…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The MIS dataset is the first to focus on multi-image safety scenarios, a necessary and complex domain. The proposed MIRage, which uses safety CoT labels to teach reasoning rather than just refusal, is an effective solution to the problem. The dataset can be a useful resource to the community. 2. The method achieves a good balance of helpfulness and harmlessness. The results in Table 4 (near-0% ASR on their benchmark) and Table 6 (a slight increase in average accuracy on general benchmarks) sh

Weaknesses

1. In Finding 1 and Discussion 1, the authors state that “fine-tuning models on such data leads to over-prudence on visual features, causing the model to reject benign visual inputs.” However, the exact experimental configuration for VLGuard-P is unclear. In the original VLGuard post-hoc fine-tuning setup, 5k general helpfulness samples were included. Therefore, two concerns arise: (1) if this experiment did not follow the same configuration, the conclusion may not be directly comparable; or (2)

Reviewer 02Rating 6Confidence 3

Strengths

- The paper addresses an interesting and important topic: safety gaps in multi-image VLMs, where it shows the benefits of COT reasoning to address these limitations. - It introduces an interesting small dataset (MIS). - The paper studies an interesting safety-loophole in in VLMs using two neutral images + text; in this scope it is well-framed, and building on existing work where neutral text and single images can still produce unsafe outputs. - Overall, the experiments and evaluations are thorou

Weaknesses

The paper shows several weaknesses hampering generalizability of the approach aimed at addressing "a significant bottleneck in the safety capabilities of existing safeguarding methods" [line 045]: - The focus is solely on two images + text (appears limited to generalize this claim) - Along this line the datasets size also appears limited, especially for the smaller categories (e.g. self-harm, privacy), with only a few hundred images. - The method is only evaluated on internally trained models a

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper systematically diagnoses and analyzes current safety fine-tuning bottlenecks through comprehensive experiments. 2. Authors present MIS, the first multi-image safety dataset, featuring a training split aimed at enhancing models’ safety-related visual perception and reasoning abilities. 3. Models finetuned with MIRage generalize well to previously unseen safety categories.

Weaknesses

1. The “Safety CoT” template is interesting but not well formalized or analyzed—it’s unclear how much of the improvement comes from reasoning versus simply from data diversity. The authors should include ablations to isolate the impact of Safety CoT vs. multi-image input vs. fine-tuning data scale. 2. While MIS and MIRage are presented as key contributions, the fine-tuning pipeline itself largely follows a standard SFT paradigm. There is limited method innovation in model architecture or trainin

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Semantic Web and Ontologies · Natural Language Processing Techniques

MethodsFocus