Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning

Jiaxing Guo; Wenjie Yang; Shengzhong Zhang; Tongshan Xu; Lun Du; Da Zheng; Zengfeng Huang

arXiv:2506.06877·cs.CL·June 25, 2025

Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning

Jiaxing Guo, Wenjie Yang, Shengzhong Zhang, Tongshan Xu, Lun Du, Da Zheng, Zengfeng Huang

PDF

Open Access

TL;DR

This paper highlights the limitations of outcome-based supervision in training math-solving LLMs, introduces a new dataset to expose reasoning flaws, and proposes a step-by-step verification method to improve reasoning accuracy.

Contribution

It introduces MathOlympiadEval dataset for detailed analysis and ParaStepVerifier, a new method for verifying mathematical reasoning steps in LLMs.

Findings

01

ParaStepVerifier significantly improves detection of flawed reasoning.

02

Existing automated judges struggle to identify reasoning errors.

03

Models often produce correct answers through unsound reasoning.

Abstract

Outcome-rewarded Large Language Models (LLMs) have demonstrated remarkable success in mathematical problem-solving. However, this success often masks a critical issue: models frequently achieve correct answers through fundamentally unsound reasoning processes, a phenomenon indicative of reward hacking. We introduce MathOlympiadEval, a new dataset with fine-grained annotations, which reveals a significant gap between LLMs' answer correctness and their low process correctness. Existing automated methods like LLM-as-a-judge struggle to reliably detect these reasoning flaws. To address this, we propose ParaStepVerifier, a novel methodology for meticulous, step-by-step verification of mathematical solutions. ParaStepVerifier identifies incorrect reasoning steps. Empirical results demonstrate that ParaStepVerifier substantially improves the accuracy of identifying flawed solutions compared to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Mathematics, Computing, and Information Processing · Intelligent Tutoring Systems and Adaptive Learning