Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Xiang Zheng; Weiqi Zhai; Wei Wang; Boyu Yang; Wenbo Li; Ruixiang Luo; Haoxiang Sun; Yucheng Wang; Zhengze Li; Meng Wang; Yuetian Du; Guojie Lin; Yaxuan Wang; Xiaoxiao Xu; Yanhu Mo; Xuan Ren; Hu Wei; and Bing Zhao

arXiv:2602.00564·cs.AI·February 27, 2026

Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs

Xiang Zheng, Weiqi Zhai, Wei Wang, Boyu Yang, Wenbo Li, Ruixiang Luo, Haoxiang Sun, Yucheng Wang, Zhengze Li, Meng Wang, Yuetian Du, Guojie Lin, Yaxuan Wang, Xiaoxiao Xu, Yanhu Mo, Xuan Ren, Hu Wei, and Bing Zhao

PDF

Open Access

TL;DR

This paper introduces ReasoningMath-Plus, a new benchmark with annotated reasoning processes to better evaluate structural mathematical reasoning in large language models, revealing limitations of current answer-only metrics.

Contribution

It presents a novel dataset with reasoning skeletons, a deterministic scoring function, and a process reward model to assess reasoning beyond final answers.

Findings

01

Models achieve high answer accuracy but low process-based scores.

02

Answer-only metrics overestimate models' reasoning capabilities.

03

The benchmark highlights the gap between final answers and reasoning quality.

Abstract

Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks, raising concerns about their ability to diagnose genuine reasoning competence. This saturation largely stems from the dominance of template-based computation and shallow arithmetic decomposition in existing datasets, which underrepresent reasoning skills such as multi-constraint coordination, constructive logical synthesis, and spatial inference. To address this gap, we introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning. Each problem emphasizes reasoning under interacting constraints, constructive solution formation, or non-trivial structural insight, and is annotated with a minimal reasoning skeleton to support fine-grained process-level evaluation. Alongside the dataset, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Intelligent Tutoring Systems and Adaptive Learning · Machine Learning in Materials Science