C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning
Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

TL;DR
This paper introduces C3L, a contrastive learning approach that improves the relevance and quality of vision-language instruction tuning data by enhancing content correlation between images and generated instructions.
Contribution
The paper proposes a novel content relevance module and contrastive learning framework to generate higher-quality VLIT data, addressing prior issues of content relevance and exposure bias.
Findings
Enhanced content relevance between images and instructions.
Improved VLIT data quality demonstrated on four benchmarks.
Effective contrastive learning module boosts data generation capability.
Abstract
Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., "exposure bias" problem). In this paper, we propose a new Content Correlated VLIT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques
MethodsContrastive Learning
