C3L: Content Correlated Vision-Language Instruction Tuning Data   Generation via Contrastive Learning

Ji Ma; Wei Suo; Peng Wang; Yanning Zhang

arXiv:2405.12752·cs.CV·July 3, 2024

C3L: Content Correlated Vision-Language Instruction Tuning Data Generation via Contrastive Learning

Ji Ma, Wei Suo, Peng Wang, Yanning Zhang

PDF

Open Access

TL;DR

This paper introduces C3L, a contrastive learning approach that improves the relevance and quality of vision-language instruction tuning data by enhancing content correlation between images and generated instructions.

Contribution

The paper proposes a novel content relevance module and contrastive learning framework to generate higher-quality VLIT data, addressing prior issues of content relevance and exposure bias.

Findings

01

Enhanced content relevance between images and instructions.

02

Improved VLIT data quality demonstrated on four benchmarks.

03

Effective contrastive learning module boosts data generation capability.

Abstract

Vision-Language Instruction Tuning (VLIT) is a critical training phase for Large Vision-Language Models (LVLMs). With the improving capabilities of open-source LVLMs, researchers have increasingly turned to generate VLIT data by using open-source LVLMs and achieved significant progress. However, such data generation approaches are bottlenecked by the following challenges: 1) Since multi-modal models tend to be influenced by prior language knowledge, directly using LVLMs to generate VLIT data would inevitably lead to low content relevance between generated data and images. 2) To improve the ability of the models to generate VLIT data, previous methods have incorporated an additional training phase to boost the generative capacity. This process hurts the generalization of the models to unseen inputs (i.e., "exposure bias" problem). In this paper, we propose a new Content Correlated VLIT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Natural Language Processing Techniques

MethodsContrastive Learning