ICONS: Influence Consensus for Vision-Language Data Selection
Xindi Wu, Mengzhou Xia, Rulin Shao, Zhiwei Deng, Pang Wei Koh, Olga Russakovsky

TL;DR
ICONS is a gradient-based data selection method that identifies valuable vision-language training examples across tasks, reducing data size while maintaining high performance and generalization.
Contribution
Introduces ICONS, a novel influence consensus approach leveraging training dynamics and majority voting for robust, scalable, and cross-task data selection in vision-language models.
Findings
Models trained on 20% data retain over 98% performance.
Selected data generalizes well to unseen tasks and architectures.
Released compact subsets for efficient model development.
Abstract
Training vision-language models via instruction tuning relies on large data mixtures spanning diverse tasks and domains, yet these mixtures frequently include redundant information that increases computational costs without proportional gains. Existing methods typically rely on task-agnostic heuristics to estimate data importance, limiting their effectiveness across tasks. We introduce ICONS, a gradient-based Influence CONsensus approach for vision-language data Selection. Our method leverages first-order training dynamics to estimate each example's influence on validation performance, then aggregates these estimates across tasks via majority voting. This cross-task consensus identifies consistently valuable data points while mitigating score calibration and outlier sensitivity, enabling robust and scalable data selection for diverse multitask mixtures. Models trained on our selected…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Advanced Image and Video Retrieval Techniques
