The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang; Chujie Zheng; Yangzhen Wu; Beichen Zhang; Runji Lin; Bowen Yu; Dayiheng Liu; Jingren Zhou; Junyang Lin

arXiv:2501.07301·cs.CL·June 6, 2025

The Lessons of Developing Process Reward Models in Mathematical Reasoning

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin

PDF

3 Models

TL;DR

This paper investigates the development of Process Reward Models for mathematical reasoning in LLMs, identifying challenges in data annotation and evaluation, and proposing improved methods that enhance performance and generalization.

Contribution

The paper introduces a consensus filtering mechanism and a comprehensive evaluation framework, significantly improving PRM performance and data efficiency over traditional methods.

Findings

01

MC estimation-based data synthesis underperforms compared to LLM-as-a-judge and human annotation.

02

Conventional BoN evaluation strategies can be biased, inflating scores and misaligning with PRM objectives.

03

The proposed methods achieve state-of-the-art PRM performance and provide practical guidelines for future research.

Abstract

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.