ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Yibo Yan; Shen Wang; Jiahao Huo; Hang Li; Boyan Li; Jiamin Su; Xiong Gao; Yi-Fan Zhang; Tianlong Xu; Zhendong Chu; Aoxiao Zhong; Kun Wang; Hui Xiong; Philip S. Yu; Xuming Hu; Qingsong Wen

arXiv:2410.04509·cs.CL·April 21, 2026·2 cites

ErrorRadar: Benchmarking Complex Mathematical Reasoning of Multimodal Large Language Models Via Error Detection

Yibo Yan, Shen Wang, Jiahao Huo, Hang Li, Boyan Li, Jiamin Su, Xiong Gao, Yi-Fan Zhang, Tianlong Xu, Zhendong Chu, Aoxiao Zhong, Kun Wang, Hui Xiong, Philip S. Yu, Xuming Hu, Qingsong Wen

PDF

TL;DR

ErrorRadar is a new benchmark designed to evaluate multimodal large language models' ability to detect and categorize errors in complex mathematical reasoning tasks, highlighting current performance gaps.

Contribution

The paper introduces ErrorRadar, the first benchmark for multimodal error detection in mathematical reasoning, with a comprehensive dataset and evaluation framework.

Findings

01

GPT-4o performs about 10% below human evaluators.

02

Current MLLMs face significant challenges in complex error detection tasks.

03

ErrorRadar reveals substantial room for improvement in multimodal mathematical reasoning.

Abstract

As the field of Multimodal Large Language Models (MLLMs) continues to evolve, their potential to revolutionize artificial intelligence is particularly promising, especially in addressing mathematical reasoning tasks. Current mathematical benchmarks predominantly focus on evaluating MLLMs' problem-solving ability, yet there is a crucial gap in addressing more complex scenarios such as error detection, for enhancing reasoning capability in complicated settings. To fill this gap, we formally formulate the new task: multimodal error detection, and introduce ErrorRadar, the first benchmark designed to assess MLLMs' capabilities in such a task. ErrorRadar evaluates two sub-tasks: error step identification and error categorization, providing a comprehensive framework for evaluating MLLMs' complex mathematical reasoning ability. It consists of 2,500 high-quality multimodal K-12 mathematical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.