TL;DR
This paper investigates the development of Process Reward Models for mathematical reasoning in LLMs, identifying challenges in data annotation and evaluation, and proposing improved methods that enhance performance and generalization.
Contribution
The paper introduces a consensus filtering mechanism and a comprehensive evaluation framework, significantly improving PRM performance and data efficiency over traditional methods.
Findings
MC estimation-based data synthesis underperforms compared to LLM-as-a-judge and human annotation.
Conventional BoN evaluation strategies can be biased, inflating scores and misaligning with PRM objectives.
The proposed methods achieve state-of-the-art PRM performance and provide practical guidelines for future research.
Abstract
Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
