Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Qianqi Yan; Yue Fan; Hongquan Li; Shan Jiang; Yang Zhao; Xinze Guan; Ching-Chen Kuo; Xin Eric Wang

arXiv:2502.16033·cs.CL·June 12, 2025

Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for Multimodal Reasoning Models

Qianqi Yan, Yue Fan, Hongquan Li, Shan Jiang, Yang Zhao, Xinze Guan, Ching-Chen Kuo, Xin Eric Wang

PDF

Open Access 1 Datasets

TL;DR

This paper introduces MMIR, a benchmark to evaluate multimodal large language models' ability to detect and reason about semantic inconsistencies in complex, layout-rich visual-textual content, revealing current models' limitations.

Contribution

The paper presents MMIR, a new challenging benchmark with 534 samples for assessing MLLMs' inconsistency reasoning, and provides comprehensive evaluation and analysis of state-of-the-art models' performance.

Findings

01

Models with dedicated reasoning capabilities outperform others.

02

Open-source models are particularly vulnerable to inconsistencies.

03

Single-modality prompting yields marginal improvements.

Abstract

Existing Multimodal Large Language Models (MLLMs) are predominantly trained and tested on consistent visual-textual inputs, leaving open the question of whether they can handle inconsistencies in real-world, layout-rich content. To bridge this gap, we propose the Multimodal Inconsistency Reasoning (MMIR) benchmark to assess MLLMs' ability to detect and reason about semantic mismatches in artifacts such as webpages, presentation slides, and posters. MMIR comprises 534 challenging samples, each containing synthetically injected errors across five reasoning-heavy categories: Factual Contradiction, Identity Misattribution, Contextual Mismatch, Quantitative Discrepancy, and Temporal/Spatial Incoherence. We evaluate six state-of-the-art MLLMs, showing that models with dedicated multimodal reasoning capabilities, such as o1, substantially outperform their counterparts while open-source models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

rippleripple/MMIR
dataset· 63 dl
63 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks