Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents
Zhenyu Liu, Yunxin Li, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang

TL;DR
ViSA introduces a collaborative agent-based visual-centric data selection method that significantly improves multimodal model training efficiency by filtering high-quality, relevant image-instruction pairs from large datasets, achieving comparable or better results with only 2.5% of the data.
Contribution
This paper presents a novel visual-centric data selection approach using collaborative agents to improve data quality for training multimodal models, reducing data requirements while maintaining performance.
Findings
Outperforms or matches state-of-the-art on seven benchmarks.
Uses only 2.5% of original data for training.
Effective ablation studies validate each component.
Abstract
To improve Multimodal Large Language Models' (MLLMs) ability to process images and complex instructions, researchers predominantly curate large-scale visual instruction tuning datasets, which are either sourced from existing vision tasks or synthetically generated using LLMs and image descriptions. However, they often suffer from critical flaws, including misaligned instruction-image pairs and low-quality images. Such issues hinder training efficiency and limit performance improvements, as models waste resources on noisy or irrelevant data with minimal benefit to overall capability. To address this issue, we propose a \textbf{Vi}sual-Centric \textbf{S}election approach via \textbf{A}gents Collaboration (ViSA), which centers on image quality assessment and image-instruction relevance evaluation. Specifically, our approach consists of 1) an image information quantification method via…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Natural Language Processing Techniques
