Does the Question Really Matter? Training-Free Data Selection for Vision-Language SFT
Peng Sun, Huawen Shen, Yi Ban, Tianfan Fu, Yanbo Wang, Yuqiang Li

TL;DR
This paper introduces CVS, a training-free data selection method for vision-language models that identifies samples requiring joint reasoning by measuring answer validity discrepancies with and without questions, improving efficiency and performance.
Contribution
CVS offers a novel, training-free approach to select high-quality multimodal samples based on answer validity changes, surpassing prior proxy-based methods.
Findings
CVS outperforms full-data training with only 10-15% of data on Vision-Flan.
CVS achieves 3.5% and 4.8% improvements over full data on Vision-Flan.
CVS reduces computational costs by up to 44.4% compared to existing methods.
Abstract
Visual instruction tuning is crucial for improving vision-language large models (VLLMs). However, many samples can be solved via linguistic patterns or common-sense shortcuts, without genuine cross-modal reasoning, limiting the effectiveness of multimodal learning. Prior data selection methods often rely on costly proxy model training and focus on difficulty or diversity, failing to capture a sample's true contribution to vision-language joint reasoning. In this paper, we propose CVS, a training-free data selection method based on the insight that, for high-quality multimodal samples, introducing the question should substantially alter the model's assessment of answer validity given an image. CVS leverages a frozen VLLM as an evaluator and measures the discrepancy in answer validity with and without conditioning on the question, enabling the identification of samples that require…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
