COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Bingli Wang; Huanze Tang; Haijun Lv; Zhishan Lin; Lixin Gu; Lei Feng; Qipeng Guo; Kai Chen

arXiv:2604.27389·cs.CV·May 14, 2026

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts

Bingli Wang, Huanze Tang, Haijun Lv, Zhishan Lin, Lixin Gu, Lei Feng, Qipeng Guo, Kai Chen

PDF

1 Datasets

TL;DR

COHERENCE is a new benchmark designed to evaluate multimodal models' ability to understand and align interleaved image-text content across diverse domains, addressing a gap in existing assessments.

Contribution

The paper introduces COHERENCE, a comprehensive benchmark with 6,161 questions for assessing fine-grained image-text alignment in interleaved contexts, along with detailed error analysis.

Findings

01

MLLMs struggle with fine-grained interleaved image-text understanding.

02

The benchmark reveals specific failure modes in current models.

03

Error analysis guides future improvements in multimodal reasoning.

Abstract

In recent years, Multimodal Large Language Models (MLLMs) have achieved remarkable progress on a wide range of multimodal benchmarks. Despite these advances, most existing benchmarks mainly focus on single-image or multi-image comprehension. In real-world scenarios such as document reading, information is often presented as interleaved multimodel contexts. This requires MLLMs not only to recognize the content of individual images, but also to identify relevant textual and visual evidence, establish fine-grained alignments between them, and reason over these aligned signals in interleaved contexts based on contextual evidence. However, there is still a lack of systematic benchmarks for quantifying the fine-grained understanding ability of MLLMs in interleaved image-text contexts. To fill this gap, we propose COHERENCE, a benchmark designed to evaluate the ability of MLLMs to recover…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

BingliW/COHERENCE
dataset· 95 dl
95 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.