Unsupervised Process Reward Models
Artyom Gadetsky,Maxim Kodryan,Siba Smarak Panigrahi,Hang Guo,Maria Brbic

TL;DR
This paper introduces an unsupervised method for training Process Reward Models (uPRMs) that eliminates the need for costly human annotations, leveraging language model probabilities to improve reasoning accuracy and robustness.
Contribution
The authors propose a novel unsupervised training approach for PRMs using LLM-derived scoring functions, enabling scalable and effective reward modeling without human supervision.
Findings
uPRM improves first erroneous step detection accuracy by up to 15% over LLM-as-a-Judge
uPRM performs comparably to supervised PRMs as a verifier, surpassing majority voting by 6.9%
uPRM enhances reinforcement learning policy robustness compared to ground-truth trained PRMs
Abstract
Process Reward Models (PRMs) are a powerful mechanism for steering large language model reasoning by providing fine-grained, step-level supervision. However, this effectiveness comes at a significant cost: PRMs require expert annotations for every reasoning step, making them costly and difficult to scale. Here, we propose a method for training unsupervised PRMs (uPRM) that requires no human supervision, neither at the level of step-by-step annotations nor through ground-truth verification of final answers. The key idea behind our approach is to define a scoring function, derived from LLM next-token probabilities, that jointly assesses candidate positions of first erroneous steps across a batch of reasoning trajectories. We demonstrate the effectiveness of uPRM across diverse scenarios: (i) uPRM achieves up to 15% absolute accuracy improvements over the LLM-as-a-Judge in identifying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
