MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

Huanjin Yao; Jiaxing Huang; Yawen Qiu; Michael K. Chen; Wenzheng Liu; Wei Zhang; Wenjie Zeng; Xikun Zhang; Jingyi Zhang; Yuxin Song; Wenhao Wu; Dacheng Tao

arXiv:2506.23563·cs.AI·July 1, 2025

MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI

Huanjin Yao, Jiaxing Huang, Yawen Qiu, Michael K. Chen, Wenzheng Liu, Wei Zhang, Wenjie Zeng, Xikun Zhang, Jingyi Zhang, Yuxin Song, Wenhao Wu, Dacheng Tao

PDF

Open Access

TL;DR

MMReason is a comprehensive benchmark designed to evaluate long-chain reasoning in multimodal large language models across diverse, challenging, and open-ended questions, addressing limitations of existing benchmarks.

Contribution

The paper introduces MMReason, a new benchmark with diverse, multi-disciplinary, and difficulty-scaled questions, along with a multi-model filtering and detailed step-by-step annotation for robust reasoning assessment.

Findings

01

Benchmarking reveals current MLLMs have limited reasoning capabilities.

02

The dataset's multi-step questions challenge models beyond simple pattern recognition.

03

The scoring mechanism provides reliable assessment of intermediate reasoning steps.

Abstract

Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs) toward Artificial General Intelligence. However, existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities from three key aspects: (1) lack of difficulty and diversity, (2) susceptibility to guessability and memorization, (3) inadequate assessment of intermediate reasoning steps. To fill this gap, we introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability with diverse, open-ended, challenging questions. First, we curate challenging questions requiring multi-step reasoning from various fields (i.e., 6 disciplines) and multiple difficulty levels (i.e., from pre-university to university, and from foundational to competition tiers). Second, these questions are reformulated into an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Topic Modeling · Natural Language Processing Techniques