Concept-skill Transferability-based Data Selection for Large Vision-Language Models
Jaewoo Lee, Boyang Li, Sung Ju Hwang

TL;DR
This paper introduces COINCIDE, a scalable data selection method for efficient finetuning of large vision-language models, focusing on diversity and transferability to reduce training costs while maintaining high performance.
Contribution
COINCIDE uses a small model to cluster and select diverse, transferable VL data, significantly reducing training data and time without sacrificing model performance.
Findings
Achieves comparable performance with only 20% of data on LLaVA-1.5.
Reduces training time by 70% on LLaVA-1.5.
Outperforms 8 baselines on two VL datasets.
Abstract
Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Web Data Mining and Analysis · Advanced Image and Video Retrieval Techniques
