MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Jiakang Yuan; Tianshuo Peng; Yilei Jiang; Yiting Lu; Renrui Zhang; Kaituo Feng; Chaoyou Fu; Tao Chen; Lei Bai; Bo Zhang; Xiangyu Yue

arXiv:2505.21327·cs.AI·May 28, 2025

MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs

Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue

PDF

Open Access 1 Datasets 3 Reviews

TL;DR

This paper introduces MME-Reasoning, a comprehensive benchmark for evaluating the logical reasoning abilities of multimodal large language models across inductive, deductive, and abductive reasoning types, revealing current models' limitations.

Contribution

It presents a new benchmark that explicitly categorizes reasoning types and thoroughly evaluates MLLMs, exposing their performance gaps and imbalances in logical reasoning tasks.

Findings

01

State-of-the-art MLLMs show limited reasoning performance.

02

Models exhibit notable imbalances across reasoning types.

03

Existing approaches like 'thinking mode' and Rule-based RL have limited effectiveness.

Abstract

Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 2

Strengths

1. The paper looks at an interesting aspect of LLM's reasoning capabilities. Unlike standard reasoning benchmarks that evaluate the models on hard, domain-specific knowledge, the proposed benchmark tries to evaluate the logical reasoning capabilities of LLMs on their own. This distinction can eliminate, to some degree, the knowledge bias that different LLMs have and focus on their logical inference capacities. 2. The paper summarizes a list of insights and findings from evaluating different mod

Weaknesses

1. Since logical inference is an ability common to all human beings, I wonder if it also makes sense to access non-human expert's performance on this benchmark and compare that with that of the models. The table shows that the models lag behind human-experts who are PhD students, I wonder how do the models' performance compares with other groups of people. Do the models already achieve on-par performance or they still lag behind?

Reviewer 02Rating 4Confidence 4

Strengths

1. The benchmark follows clear design principles (comprehensiveness, going beyond perception, minimizing knowledge dependence, and diverse evaluation formats), and the construction pipeline is well described. 2. Data are sourced from textbooks, logic workbooks, online resources, exams, existing benchmarks, and author-designed/synthetic problems, with manual filtering to remove items that primarily rely on perception or complex domain knowledge. 3. A broad set of recent, representative MLLMs (clo

Weaknesses

1. The core goal of MME-Reasoning is to systematically assess MLLMs’ logical reasoning—explicitly covering inductive, deductive, and abductive types—while striving to decouple perception and domain knowledge from reasoning. However, many similar benchmarks already exist, e.g.: - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal Reasoning Benchmark - VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models - MM-IQ: Benchmarking Human-Like

Reviewer 03Rating 4Confidence 4

Strengths

1. Curation and metadata richness: The dataset includes multiple dimensions of annotation (reasoning type, difficulty, capability tag, question type) which allows finer‐grained analysis of model behavior rather than only single accuracy numbers, which is critical for both benchmark itself and evaluated models. 2. Focus on reasoning rather than perception and knowledge: The paper tries to filter out questions that are mostly about image recognition or domain‐knowledge recall and push the focus t

Weaknesses

1. Knowledge vs reasoning vs perception boundary: While the authors try to reduce dependence on domain knowledge, the distinction between “reasoning” and “heavy factual knowledge” is somewhat fuzzy. Some tasks may still lean on domain knowledge (e.g., biology diagrams, chemistry processes) which adds potential confounds. Likewise, decoupling reasoning and perception is equally difficult, and I hope to see more evidence of both to prove that the benchmark focuses on reasoning. 2. The novelty may

Code & Models

Datasets

InternScience/MME-Reasoning
dataset· 167 dl
167 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Multi-Agent Systems and Negotiation