ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning
Xiaodong Yu, Ben Zhou, Hao Cheng, Dan Roth

TL;DR
This paper introduces ReasonAgain, a novel approach using extractable symbolic programs for automated evaluation of mathematical reasoning in language models, revealing their reasoning fragility.
Contribution
It proposes extracting symbolic programs from datasets to evaluate models' reasoning across varied inputs, highlighting limitations of current static evaluation methods.
Findings
Models show significant accuracy drops with program-based evaluation.
Extracted programs encapsulate proper reasoning for math questions.
Evaluation reveals fragility in state-of-the-art LLMs' reasoning.
Abstract
Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics Education and Teaching Techniques · Intelligent Tutoring Systems and Adaptive Learning · Evolutionary Algorithms and Applications
