TL;DR
BetaPRM introduces a distributional reward model that predicts both success probabilities and reliability, enabling more trustworthy step-level feedback and improving reasoning efficiency.
Contribution
The paper proposes BetaPRM, a novel distributional PRM that learns a reliability signal for step rewards, enhancing decision trustworthiness and enabling adaptive computation strategies.
Findings
BetaPRM improves Best-of-N reasoning accuracy across benchmarks.
ACA reduces token usage by up to 33.57% while increasing accuracy.
BetaPRM maintains effective error detection with reliability signals.
Abstract
Process Reward Models (PRMs) provide step-level feedback for reasoning, but current PRMs usually output only a single reward score for each step. Downstream methods must therefore treat imperfect step-level reward predictions as reliable decision signals, with no indication of when these predictions should be trusted. We propose BetaPRM, a distributional PRM that predicts both a step-level success probability and the reliability of that prediction. Given step-success supervision from Monte Carlo continuations, BetaPRM learns a Beta belief that explains the observed number of successful continuations through a Beta-Binomial likelihood, rather than regressing to the finite-sample success ratio as a point target. This learned reliability signal indicates when a step reward should be trusted, enabling downstream applications to distinguish reliable rewards from uncertain ones. As one…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
