Img-Diff: Contrastive Data Synthesis for Multimodal Large Language   Models

Qirui Jiao; Daoyuan Chen; Yilun Huang; Bolin Ding; Yaliang Li; Ying; Shen

arXiv:2408.04594·cs.CV·December 20, 2024

Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models

Qirui Jiao, Daoyuan Chen, Yilun Huang, Bolin Ding, Yaliang Li, Ying, Shen

PDF

Open Access 1 Datasets

TL;DR

This paper introduces Img-Diff, a contrastive data synthesis method that creates detailed image difference datasets to improve multimodal large language models' fine-grained image recognition capabilities.

Contribution

The paper presents a novel automated data synthesis approach for generating contrastive image pairs and difference descriptions, enhancing MLLMs' performance on image understanding tasks.

Findings

01

Significant performance improvements on image difference tasks.

02

Outperforms GPT-4V and Gemini on the MMVP benchmark.

03

Provides a scalable, high-quality dataset for fine-grained image recognition.

Abstract

High-performance Multimodal Large Language Models (MLLMs) are heavily dependent on data quality. To advance fine-grained image recognition within MLLMs, we introduce a novel data synthesis method inspired by contrastive learning and image difference captioning. Our key idea involves challenging the model to discern both matching and distinct elements by scrutinizing object differences in detailed regions across similar images. We begin by generating pairs of similar images that emphasize object variations. Following this, we employ a Difference Area Generator to pinpoint object differences, and subsequently, a Difference Captions Generator to articulate these differences. This process results in a high-quality dataset of "object replacement" samples, termed Img-Diff, which can be scaled as needed due to its automated nature. We leverage this generated dataset to fine-tune…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

datajuicer/Img-Diff
dataset· 90 dl
90 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling

MethodsContrastive Learning