MMLU-Reason: Benchmarking Multi-Task Multi-modal Language Understanding and Reasoning
Guiyao Tie, Xueyang Zhou, Tianhe Gu, Ruihang Zhang, Chaoran Hu, Sizhe Zhang, Mengqu Sun, Yan Zhang, Pan Zhou, Lichao Sun

TL;DR
The paper introduces MMLU-Reason, a comprehensive benchmark for evaluating multi-modal reasoning in large language models, highlighting their strengths and limitations in complex, multi-hop, and symbolic reasoning tasks.
Contribution
It presents a new challenging dataset and a reasoning trace evaluation pipeline to assess reasoning quality beyond accuracy in multi-modal models.
Findings
MLLMs-T outperform non-thinking models in reasoning tasks
Top models still exhibit reasoning pathologies like inconsistency
Benchmark exposes gaps between accuracy and reasoning quality
Abstract
Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMLU-Reason, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMLU-Reason comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Comprehensive and High-Difficulty Benchmark Design: MMMR covers six diverse reasoning domains with carefully curated, high-complexity tasks, ensuring broad and deep evaluation of multi-modal reasoning. Its inclusion of symbolic, spatial, and scientific reasoning makes it a strong diagnostic tool beyond perception-based benchmarks. 2. Innovative Evaluation Pipeline for Reasoning Quality: The proposed Reasoning Trace Evaluation Pipeline (RTEP) introduces structured metrics—RTQ, RTA, and RSC—to
1. As a key evaluation component of this paper, the description of how RTA, RTQ, and RSC scores are determined lacks sufficient detail and example prompts. The statement “each rated on a 0–10 scale. Scores combine rule-based checklists and semantic checks. Leveraging GPT-4o as an automated evaluator” is overly brief and does not clearly specify the evaluation criteria. 2. The paper does not provide enough information on the data collection process, data sources, or filtering standards, raising
1. The introduction of MMMR as a new benchmark for multi-modal reasoning is a significant contribution. By focusing on reasoning depth rather than just accuracy, the authors address an important gap in the evaluation of Multi-Modal Large Language Models (MLLMs). 2. The Reasoning Trace Evaluation Pipeline (RTEP) is a novel addition, offering insights into the reasoning process, providing valuable metrics for understanding the coherence, consistency, and relevance of thinking traces, which is a ke
1. There is a lack of clear discussion on how to address or mitigate the reasoning pathologies such as overthinking and inconsistency, which are frequently observed in the models evaluated. 2. Issues with Figure Layout: The layout of the figures in the paper is problematic. The arrangement of the diagrams/figures does not appear to be optimal and affects the overall presentation. It would be beneficial to revise the figure placements and spacing to ensure clarity and improve the visual flow of
1. **Pioneering a Critical Evaluation Paradigm:** The paper's strength is its focus on evaluating the **reasoning process** rather than just the final answer correctness. It correctly identifies this as a major gap in existing MLLM benchmarks and proposes a structured way to address it. This shifts the conversation from "Did the model get it right?" to "How did the model arrive at its conclusion, and was the process sound?" 2. **Conceptual Innovation of the Reasoning Trace Evaluation Pipeline
1. **Misleading Title:** The benchmark contains 1,083 questions in total (977 in the test set). In the context of modern LLM benchmarks (e.g., MMLU with >14k questions, Big-Bench with >200 tasks), this size is small. A smaller, high-quality, deeply annotated dataset is valuable but still is small. 2. **Limited Scope of Analysis:** The most novel and interesting analyses are presented as case studies on a very small number of models. * **Thinking Quality Analysis (Table 4):** The detail
1. it shifts the evaluation from what the answer is to how the model arrived at it. This provides a much deeper understanding of model failures. 2. the RTEP framework (with metrics like RTQ, RTA, and RSC) offers a structured way to quantify reasoning quality beyond simple accuracy. 3. The benchmark's 1,083 questions are specifically designed to be challenging and require multi-step, multi-modal reasoning across six distinct and complex domains. 4. it categorizes the types of errors they make (e.
Experimental Limitations: The authors explicitly state (Section 2.3) that "Due to API restrictions, statistical significance tests were limited for closed-source models," which is a minor weakness in the completeness of the experimental analysis.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
