Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning
Jiaxing Guo, Wenjie Yang, Shengzhong Zhang, Tongshan Xu, Lun Du, Da Zheng, Zengfeng Huang

TL;DR
This paper highlights the limitations of outcome-based supervision in training math-solving LLMs, introduces a new dataset to expose reasoning flaws, and proposes a step-by-step verification method to improve reasoning accuracy.
Contribution
It introduces MathOlympiadEval dataset for detailed analysis and ParaStepVerifier, a new method for verifying mathematical reasoning steps in LLMs.
Findings
ParaStepVerifier significantly improves detection of flawed reasoning.
Existing automated judges struggle to identify reasoning errors.
Models often produce correct answers through unsound reasoning.
Abstract
Outcome-rewarded Large Language Models (LLMs) have demonstrated remarkable success in mathematical problem-solving. However, this success often masks a critical issue: models frequently achieve correct answers through fundamentally unsound reasoning processes, a phenomenon indicative of reward hacking. We introduce MathOlympiadEval, a new dataset with fine-grained annotations, which reveals a significant gap between LLMs' answer correctness and their low process correctness. Existing automated methods like LLM-as-a-judge struggle to reliably detect these reasoning flaws. To address this, we propose ParaStepVerifier, a novel methodology for meticulous, step-by-step verification of mathematical solutions. ParaStepVerifier identifies incorrect reasoning steps. Empirical results demonstrate that ParaStepVerifier substantially improves the accuracy of identifying flawed solutions compared to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Mathematics, Computing, and Information Processing · Intelligent Tutoring Systems and Adaptive Learning
