CompCap: Improving Multimodal Large Language Models with Composite   Captions

Xiaohui Chen; Satya Narayan Shukla; Mahmoud Azab; Aashu Singh; Qifan; Wang; David Yang; ShengYun Peng; Hanchao Yu; Shen Yan; Xuewen Zhang; Baosheng; He

arXiv:2412.05243·cs.CV·December 9, 2024

CompCap: Improving Multimodal Large Language Models with Composite Captions

Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan, Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng, He

PDF

Open Access 2 Datasets

TL;DR

This paper introduces CompCap, a framework for generating high-quality composite image captions, and demonstrates that fine-tuning MLLMs with this dataset improves their understanding of composite images across multiple benchmarks.

Contribution

The paper presents a novel method to synthesize composite image captions and creates a large dataset, CompCap-118K, to enhance MLLMs' ability to interpret composite images.

Findings

01

CompCap-118K improves MLLMs' understanding of composite images.

02

Fine-tuning with CompCap-118K yields average gains of 1.7% to 2.9% on benchmarks.

03

The approach bridges the gap between question-answer datasets and high-quality caption datasets for composite images.

Abstract

How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications