MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

Jun Feng; Zixin Wang; Zhentao Zhang; Yue Guo; Zhihan Zhou; Xiuyi Chen; Zhenyang Li; Dawei Yin

arXiv:2508.06009·cs.CV·August 11, 2025

MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

Jun Feng, Zixin Wang, Zhentao Zhang, Yue Guo, Zhihan Zhou, Xiuyi Chen, Zhenyang Li, Dawei Yin

PDF

Open Access 1 Datasets 4 Reviews

TL;DR

MathReal is a new benchmark dataset with real-world educational images to evaluate multimodal large language models' math reasoning, revealing their challenges and guiding future improvements.

Contribution

The paper introduces MathReal, a realistic, diverse dataset of 2,000 educational images, and provides a systematic evaluation of current multimodal models' math reasoning in real-world scenarios.

Findings

01

Existing MLLMs struggle with real-world educational images.

02

Performance varies across image quality and question difficulty.

03

Insights into recognition, comprehension, and reasoning errors.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 6Confidence 3

Strengths

It is nice that a portion of this dataset includes pairs of realistic images and cleaned up images, to help researchers better pinpoint weaknesses of MLLMs in a somewhat controlled manner. I also liked the inclusion of metadata around fine-grained subtypes of image noise (e.g. blur, rotation), and annotator-written descriptions of math problem figures (so that the content of this benchmark can also be evaluated in an image-less, text-only manner). The authors also provide really extensive benchm

Weaknesses

I would have liked to see more details around the provenance of this dataset. I know that exact provenance may be difficult to give due to anonymity contraints, but a few details are missing. For instance, where are these students from (country, region, how many schools)? What languages are present in the data; Figure 1 suggests math images contain Chinese but QA is in English, while Figure 3 says the data is bilingual, but doesn’t specify language? (And if the data is bilingual, does a crosslin

Reviewer 02Rating 2Confidence 3

Strengths

* **Originality and Significance:** The paper's primary strength lies in its novel contribution of a "real-world" benchmark. While numerous multimodal math benchmarks exist (e.g., MathVista, MathVerse), they predominantly use clean, synthetic, or post-processed images. * **Quality:** The authors employ a rigorous multi-stage process involving: (1) automated and multi-model (GPT-4o, Doubao, Qwen) filtering to ensure data relevance; (2) a three-stage, fully manual annotation process on a dedi

Weaknesses

* **Dataset Language and Generalizability:** A significant limitation, which is only mentioned deep in the appendix (Section C.2, line 913), is that all questions are in Chinese. This should be stated clearly in the abstract and introduction. While a high-quality Chinese benchmark is valuable, this linguistic constraint limits the dataset's immediate utility as a general, global benchmark for evaluating MLLMs, many of which have an English-centric pre-training corpus. The performance of these

Reviewer 03Rating 2Confidence 4

Strengths

1. The paper tackles a practically meaningful problem—evaluating MLLMs under realistic conditions where image quality and layout are imperfect. 2. The dataset is systematically annotated with visual degradation categories, educational levels, and question types, providing a structured way to analyze failure modes. 3. The experimental coverage is extensive, including 40 models and both open- and closed-source ones, with a consistent evaluation protocol. 4. The error taxonomy (OCR, perception, rea

Weaknesses

1. The paper overclaims novelty by stating MathReal is “the first real-world benchmark” for visual math reasoning. Similar “real-scene” or “in-the-wild” multimodal math datasets—are introduced in the ACM MM 2025 paper https://dl.acm.org/doi/10.1145/3746027.3758240—already explored authentic photo-based or user-captured math scenarios. These should be cited and compared directly. 2. The dataset scale (2,000 images) is relatively small and may not justify the “benchmark” positioning without stron

Reviewer 04Rating 4Confidence 4

Strengths

1. The paper clearly points out that existing mathematical multimodal reasoning benchmarks mainly focus on 'clean image' scenarios and lack tests in real educational settings (problems photographed by K–12 students on their phones), making the motivation reasonable and practically significant.It emphasizes that noise in real images (such as blurriness, perspective changes, and handwritten interference) is indeed a current weakness of MLLMs. 2. Covers 40 MLLMs, including both open-source and clo

Weaknesses

1. All samples come from a K–12 context, and their effectiveness in generalizing to international or higher-level tasks is limited. The benchmark lacks classifications for subjects and fields, as well as the varying impacts of image noise in these areas. 2. Although the distribution of error types is illustrated, there is a lack of quantitative analysis (for example, the contribution of different noise types to OCR errors). What specific noise can make a great difference? 3. Compared with existi

Code & Models

Datasets

junfeng0288/MathReal
dataset· 99 dl
99 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Educational Tools and Methods