Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models
Qirui Jiao, Daoyuan Chen, Yilun Huang, Bolin Ding, Yaliang Li, Ying, Shen

TL;DR
This paper introduces Img-Diff, a contrastive data synthesis method that creates detailed image difference datasets to improve multimodal large language models' fine-grained image recognition capabilities.
Contribution
The paper presents a novel automated data synthesis approach for generating contrastive image pairs and difference descriptions, enhancing MLLMs' performance on image understanding tasks.
Findings
Significant performance improvements on image difference tasks.
Outperforms GPT-4V and Gemini on the MMVP benchmark.
Provides a scalable, high-quality dataset for fine-grained image recognition.
Abstract
High-performance Multimodal Large Language Models (MLLMs) are heavily dependent on data quality. To advance fine-grained image recognition within MLLMs, we introduce a novel data synthesis method inspired by contrastive learning and image difference captioning. Our key idea involves challenging the model to discern both matching and distinct elements by scrutinizing object differences in detailed regions across similar images. We begin by generating pairs of similar images that emphasize object variations. Following this, we employ a Difference Area Generator to pinpoint object differences, and subsequently, a Difference Captions Generator to articulate these differences. This process results in a high-quality dataset of "object replacement" samples, termed Img-Diff, which can be scaled as needed due to its automated nature. We leverage this generated dataset to fine-tune…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
MethodsContrastive Learning
