Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization
Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang

TL;DR
This paper introduces IPVRM, a novel model that learns prefix-conditioned value functions to improve token-level reward signals in distribution-level optimization, enhancing reasoning accuracy.
Contribution
It proposes IPVRM to address train-inference mismatch in implicit reward models and combines it with Distribution-Level RL for better reasoning performance.
Findings
IPVRM significantly improves step-verification F1 on ProcessBench.
Distribution-Level RL benefits from IPVRM's calibrated prefix values.
Combining IPVRM with DistRL enhances downstream reasoning accuracy.
Abstract
Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
