Concept-skill Transferability-based Data Selection for Large   Vision-Language Models

Jaewoo Lee; Boyang Li; Sung Ju Hwang

arXiv:2406.10995·cs.CV·October 3, 2024

Concept-skill Transferability-based Data Selection for Large Vision-Language Models

Jaewoo Lee, Boyang Li, Sung Ju Hwang

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces COINCIDE, a scalable data selection method for efficient finetuning of large vision-language models, focusing on diversity and transferability to reduce training costs while maintaining high performance.

Contribution

COINCIDE uses a small model to cluster and select diverse, transferable VL data, significantly reducing training data and time without sacrificing model performance.

Findings

01

Achieves comparable performance with only 20% of data on LLaVA-1.5.

02

Reduces training time by 70% on LLaVA-1.5.

03

Outperforms 8 baselines on two VL datasets.

Abstract

Instruction tuning, or supervised finetuning on extensive task-specific data, is necessary for Large Vision-Language Models (LVLMs) to generalize well across a broad range of vision-language (VL) tasks. However, training on large VL datasets can become prohibitively expensive. In this work, we introduce COINCIDE, an effective and scalable data selection technique that uses a small model as a reference model to select visual instruction tuning data for efficient finetuning of a target LVLM, focusing on diversity and transferability. Specifically, we cluster the training data using internal activations from a small model, which identifies VL concept-skill compositions needed by a target LVLM. We then sample data from these diverse clusters by considering their density and transferability, or the ability to transfer well to other concept-skill compositions. This approach ensures the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

g-jwlee/coincide_code
pytorchOfficial

Videos

Concept-skill Transferability-based Data Selection for Large Vision-Language Models· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Web Data Mining and Analysis · Advanced Image and Video Retrieval Techniques