Evaluating Mathematical Reasoning Beyond Accuracy
Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, Pengfei Liu

TL;DR
This paper introduces ReasonEval, a new methodology for evaluating the quality of reasoning steps in large language models' mathematical tasks, going beyond just final answer accuracy.
Contribution
The paper presents ReasonEval, an automatic evaluation framework for reasoning quality that outperforms baselines and demonstrates strong generalization in mathematical reasoning tasks.
Findings
ReasonEval outperforms baseline methods in meta-evaluation datasets.
Higher final-answer accuracy does not always mean better reasoning quality.
ReasonEval can effectively guide data selection and improve reasoning assessment.
Abstract
The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. ReasonEval employs validity and redundancy to characterize the reasoning quality, as well as accompanying LLMs to assess them automatically. We explore different design options for the LLM-based evaluators and empirically demonstrate that ReasonEval, when instantiated with base models possessing strong mathematical knowledge and trained with high-quality labeled data, consistently outperforms baseline methods in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMathematics Education and Teaching Techniques
MethodsFocus · Balanced Selection
