CompCap: Improving Multimodal Large Language Models with Composite Captions
Xiaohui Chen, Satya Narayan Shukla, Mahmoud Azab, Aashu Singh, Qifan, Wang, David Yang, ShengYun Peng, Hanchao Yu, Shen Yan, Xuewen Zhang, Baosheng, He

TL;DR
This paper introduces CompCap, a framework for generating high-quality composite image captions, and demonstrates that fine-tuning MLLMs with this dataset improves their understanding of composite images across multiple benchmarks.
Contribution
The paper presents a novel method to synthesize composite image captions and creates a large dataset, CompCap-118K, to enhance MLLMs' ability to interpret composite images.
Findings
CompCap-118K improves MLLMs' understanding of composite images.
Fine-tuning with CompCap-118K yields average gains of 1.7% to 2.9% on benchmarks.
The approach bridges the gap between question-answer datasets and high-quality caption datasets for composite images.
Abstract
How well can Multimodal Large Language Models (MLLMs) understand composite images? Composite images (CIs) are synthetic visuals created by merging multiple visual elements, such as charts, posters, or screenshots, rather than being captured directly by a camera. While CIs are prevalent in real-world applications, recent MLLM developments have primarily focused on interpreting natural images (NIs). Our research reveals that current MLLMs face significant challenges in accurately understanding CIs, often struggling to extract information or perform complex reasoning based on these images. We find that existing training data for CIs are mostly formatted for question-answer tasks (e.g., in datasets like ChartQA and ScienceQA), while high-quality image-caption datasets, critical for robust vision-language alignment, are only available for NIs. To bridge this gap, we introduce Composite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
