CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal   Understanding and Generation

Wei Chen; Lin Li; Yongqi Yang; Bin Wen; Fan Yang; Tingting Gao; Yu Wu,; Long Chen

arXiv:2406.10462·cs.CV·April 3, 2025

CoMM: A Coherent Interleaved Image-Text Dataset for Multimodal Understanding and Generation

Wei Chen, Lin Li, Yongqi Yang, Bin Wen, Fan Yang, Tingting Gao, Yu Wu,, Long Chen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

CoMM is a high-quality dataset designed to improve the coherence, consistency, and alignment of interleaved image-text sequences, enabling better multimodal understanding and generation in large language models.

Contribution

The paper introduces CoMM, a novel dataset with a multi-perspective filtering strategy to enhance training data quality for multimodal interleaved content generation.

Findings

01

CoMM significantly improves MLLMs' in-context learning capabilities.

02

New evaluation tasks effectively measure interleaved generation abilities.

03

High-quality dataset enhances coherence and semantic alignment in multimodal outputs.

Abstract

Interleaved image-text generation has emerged as a crucial multimodal task, aiming at creating sequences of interleaved visual and textual content given a query. Despite notable advancements in recent multimodal large language models (MLLMs), generating integrated image-text sequences that exhibit narrative coherence and entity and style consistency remains challenging due to poor training data quality. To address this gap, we introduce CoMM, a high-quality Coherent interleaved image-text MultiModal dataset designed to enhance the coherence, consistency, and alignment of generated multimodal content. Initially, CoMM harnesses raw data from diverse sources, focusing on instructional content and visual storytelling, establishing a foundation for coherent and consistent content. To further refine the data quality, we devise a multi-perspective filter strategy that leverages advanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hkust-longgroup/comm
pytorch

Datasets

weisuxi/CoMM
dataset· 169 dl
169 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques