Same Content, Different Answers: Cross-Modal Inconsistency in MLLMs
Angela van Sprang, Laurens Samson, Ana Lucic, Erman Acar, Sennay Ghebreab, Yuki M. Asano

TL;DR
This paper introduces new benchmarks to evaluate cross-modal inconsistency in multimodal large language models, revealing significant modality gaps and factors affecting model reasoning across vision and language.
Contribution
The paper presents REST and REST+ benchmarks for systematic evaluation of cross-modal inconsistency in MLLMs, highlighting their limitations and influencing factors.
Findings
State-of-the-art MLLMs show substantial modality inconsistency.
Rendering text as image or vice versa does not resolve inconsistency.
Visual features like color and resolution impact model performance.
Abstract
We introduce two new benchmarks REST and REST+ (Render-Equivalence Stress Tests) to enable systematic evaluation of cross-modal inconsistency in multimodal large language models (MLLMs). MLLMs are trained to represent vision and language in the same embedding space, yet they cannot perform the same tasks in both modalities. Our benchmarks contain samples with the same semantic information in three modalities (image, text, mixed) and we show that state-of-the-art MLLMs cannot consistently reason over these different modalities. We evaluate 15 MLLMs and find that the degree of modality inconsistency varies substantially, even when accounting for problems with text recognition (OCR). Neither rendering text as image nor rendering an image as text solves the inconsistency. Even if OCR is correct, we find that visual characteristics (text colour and resolution, but not font) and the number of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
