VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning

Mingkang Dong; Hongyi Cai; Jie Li; Sifan Zhou; Bin Ren; Kunyu Peng; Yuqian Fu

arXiv:2603.01195·cs.CV·March 3, 2026

VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning

Mingkang Dong, Hongyi Cai, Jie Li, Sifan Zhou, Bin Ren, Kunyu Peng, Yuqian Fu

PDF

Open Access

TL;DR

VisNec introduces a data selection method that identifies visually necessary samples for multimodal instruction tuning, significantly reducing data size while maintaining or improving performance across multiple benchmarks.

Contribution

The paper presents VisNec, a novel framework for measuring visual necessity in training samples, improving data efficiency and robustness in multimodal instruction tuning.

Findings

01

Training on 15% of data selected by VisNec achieves full-data performance.

02

Selected data surpasses full-data training on smaller datasets.

03

VisNec enhances data efficiency and robustness in multimodal models.

Abstract

The effectiveness of multimodal instruction tuning depends not only on dataset scale, but critically on whether training samples genuinely require visual reasoning. However, existing instruction datasets often contain a substantial portion of visually redundant samples (solvable from text alone), as well as multimodally misaligned supervision that can degrade learning. To address this, we propose VisNec (Visual Necessity Score), a principled data selection framework that measures the marginal contribution of visual input during instruction tuning. By comparing predictive loss with and without visual context, VisNec identifies whether a training instance is vision-critical, redundant, or misaligned. To preserve task diversity, we combine VisNec with semantic clustering and select high-necessity samples within each cluster. Across 10 downstream benchmarks, training on only 15% of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning