Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning

Zhaohui Yang; Chenghua He; Xiaowen Shi; Linjing Li; Qiyue Yin; Shihong Deng; Daxin Jiang

arXiv:2505.14391·cs.AI·May 21, 2025

Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning

Zhaohui Yang, Chenghua He, Xiaowen Shi, Linjing Li, Qiyue Yin, Shihong Deng, Daxin Jiang

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel annotation method for process reward models that better captures long chain-of-thought reasoning, including self-correction, leading to improved performance in mathematical reasoning tasks.

Contribution

The paper proposes a new data annotation technique for process reward models that accounts for self-correction and reflection in long reasoning chains, trained on 1.7 million samples.

Findings

01

PRM outperforms existing models on multiple metrics

02

The annotation method improves data efficiency and performance

03

The approach demonstrates stability and generalizability

Abstract

Many studies focus on data annotation techniques for training effective PRMs. However, current methods encounter a significant issue when applied to long CoT reasoning processes: they tend to focus solely on the first incorrect step and all preceding steps, assuming that all subsequent steps are incorrect. These methods overlook the unique self-correction and reflection mechanisms inherent in long CoT, where correct reasoning steps may still occur after initial reasoning mistakes. To address this issue, we propose a novel data annotation method for PRMs specifically designed to score the long CoT reasoning process. Given that under the reflection pattern, correct and incorrect steps often alternate, we introduce the concepts of Error Propagation and Error Cessation, enhancing PRMs' ability to identify both effective self-correction behaviors and reasoning based on erroneous steps.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Beyond the First Error: Process Reward Models for Reflective Mathematical Reasoning· underline

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning

MethodsFocus