Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Peng Sun; Huawen Shen; Yi Ban; Tianfan Fu; Yanbo Wang; Yuqiang Li

arXiv:2603.09715·cs.AI·March 11, 2026

Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT

Peng Sun, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li

PDF

Open Access

TL;DR

This paper introduces CVS, a training-free data selection method for vision-language models that identifies samples requiring joint reasoning by measuring answer validity discrepancies with and without questions, improving efficiency and performance.

Contribution

CVS offers a novel, training-free approach to select high-quality multimodal samples based on answer validity changes, surpassing prior proxy-based methods.

Findings

01

CVS outperforms full-data training with only 10-15% of data on Vision-Flan.

02

CVS achieves 3.5% and 4.8% improvements over full data on Vision-Flan.

03

CVS reduces computational costs by up to 44.4% compared to existing methods.

Abstract

Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling