MME-Reasoning: A Comprehensive Benchmark for Logical Reasoning in MLLMs
Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, Xiangyu Yue

TL;DR
This paper introduces MME-Reasoning, a comprehensive benchmark for evaluating the logical reasoning abilities of multimodal large language models across inductive, deductive, and abductive reasoning types, revealing current models' limitations.
Contribution
It presents a new benchmark that explicitly categorizes reasoning types and thoroughly evaluates MLLMs, exposing their performance gaps and imbalances in logical reasoning tasks.
Findings
State-of-the-art MLLMs show limited reasoning performance.
Models exhibit notable imbalances across reasoning types.
Existing approaches like 'thinking mode' and Rule-based RL have limited effectiveness.
Abstract
Logical reasoning is a fundamental aspect of human intelligence and an essential capability for multimodal large language models (MLLMs). Despite the significant advancement in multimodal reasoning, existing benchmarks fail to comprehensively evaluate their reasoning abilities due to the lack of explicit categorization for logical reasoning types and an unclear understanding of reasoning. To address these issues, we introduce MME-Reasoning, a comprehensive benchmark designed to evaluate the reasoning ability of MLLMs, which covers all three types of reasoning (i.e., inductive, deductive, and abductive) in its questions. We carefully curate the data to ensure that each question effectively evaluates reasoning ability rather than perceptual skills or knowledge breadth, and extend the evaluation protocols to cover the evaluation of diverse questions. Our evaluation reveals substantial…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper looks at an interesting aspect of LLM's reasoning capabilities. Unlike standard reasoning benchmarks that evaluate the models on hard, domain-specific knowledge, the proposed benchmark tries to evaluate the logical reasoning capabilities of LLMs on their own. This distinction can eliminate, to some degree, the knowledge bias that different LLMs have and focus on their logical inference capacities. 2. The paper summarizes a list of insights and findings from evaluating different mod
1. Since logical inference is an ability common to all human beings, I wonder if it also makes sense to access non-human expert's performance on this benchmark and compare that with that of the models. The table shows that the models lag behind human-experts who are PhD students, I wonder how do the models' performance compares with other groups of people. Do the models already achieve on-par performance or they still lag behind?
1. The benchmark follows clear design principles (comprehensiveness, going beyond perception, minimizing knowledge dependence, and diverse evaluation formats), and the construction pipeline is well described. 2. Data are sourced from textbooks, logic workbooks, online resources, exams, existing benchmarks, and author-designed/synthetic problems, with manual filtering to remove items that primarily rely on perception or complex domain knowledge. 3. A broad set of recent, representative MLLMs (clo
1. The core goal of MME-Reasoning is to systematically assess MLLMs’ logical reasoning—explicitly covering inductive, deductive, and abductive types—while striving to decouple perception and domain knowledge from reasoning. However, many similar benchmarks already exist, e.g.: - Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal Reasoning Benchmark - VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models - MM-IQ: Benchmarking Human-Like
1. Curation and metadata richness: The dataset includes multiple dimensions of annotation (reasoning type, difficulty, capability tag, question type) which allows finer‐grained analysis of model behavior rather than only single accuracy numbers, which is critical for both benchmark itself and evaluated models. 2. Focus on reasoning rather than perception and knowledge: The paper tries to filter out questions that are mostly about image recognition or domain‐knowledge recall and push the focus t
1. Knowledge vs reasoning vs perception boundary: While the authors try to reduce dependence on domain knowledge, the distinction between “reasoning” and “heavy factual knowledge” is somewhat fuzzy. Some tasks may still lean on domain knowledge (e.g., biology diagrams, chemistry processes) which adds potential confounds. Likewise, decoupling reasoning and perception is equally difficult, and I hope to see more evidence of both to prove that the benchmark focuses on reasoning. 2. The novelty may
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Multi-Agent Systems and Negotiation
