Evaluating Mathematical Reasoning Beyond Accuracy

Shijie Xia; Xuefeng Li; Yixin Liu; Tongshuang Wu; Pengfei Liu

arXiv:2404.05692·cs.CL·January 15, 2025·1 cites

Evaluating Mathematical Reasoning Beyond Accuracy

Shijie Xia, Xuefeng Li, Yixin Liu, Tongshuang Wu, Pengfei Liu

PDF

Open Access 1 Repo 2 Models 1 Video

TL;DR

This paper introduces ReasonEval, a new methodology for evaluating the quality of reasoning steps in large language models' mathematical tasks, going beyond just final answer accuracy.

Contribution

The paper presents ReasonEval, an automatic evaluation framework for reasoning quality that outperforms baselines and demonstrates strong generalization in mathematical reasoning tasks.

Findings

01

ReasonEval outperforms baseline methods in meta-evaluation datasets.

02

Higher final-answer accuracy does not always mean better reasoning quality.

03

ReasonEval can effectively guide data selection and improve reasoning assessment.

Abstract

The leaderboard of Large Language Models (LLMs) in mathematical tasks has been continuously updated. However, the majority of evaluations focus solely on the final results, neglecting the quality of the intermediate steps. This oversight can mask underlying problems, such as logical errors or unnecessary steps in the reasoning process. To measure reasoning beyond final-answer accuracy, we introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps. ReasonEval employs validity and redundancy to characterize the reasoning quality, as well as accompanying LLMs to assess them automatically. We explore different design options for the LLM-based evaluators and empirically demonstrate that ReasonEval, when instantiated with base models possessing strong mathematical knowledge and trained with high-quality labeled data, consistently outperforms baseline methods in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gair-nlp/reasoneval
pytorchOfficial

Models

Videos

Evaluating Mathematical Reasoning Beyond Accuracy· underline

Taxonomy

TopicsMathematics Education and Teaching Techniques

MethodsFocus · Balanced Selection