Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

Shiping Gao; Hongzhan Chen; Xiaojun Quan; Qifan Wang; Lifu Huang

arXiv:2604.13197·cs.CL·April 16, 2026

Unleashing Implicit Rewards: Prefix-Value Learning for Distribution-Level Optimization

Shiping Gao, Hongzhan Chen, Xiaojun Quan, Qifan Wang, Lifu Huang

PDF

TL;DR

This paper introduces IPVRM, a novel model that learns prefix-conditioned value functions to improve token-level reward signals in distribution-level optimization, enhancing reasoning accuracy.

Contribution

It proposes IPVRM to address train-inference mismatch in implicit reward models and combines it with Distribution-Level RL for better reasoning performance.

Findings

01

IPVRM significantly improves step-verification F1 on ProcessBench.

02

Distribution-Level RL benefits from IPVRM's calibrated prefix values.

03

Combining IPVRM with DistRL enhances downstream reasoning accuracy.

Abstract

Process reward models (PRMs) provide fine-grained reward signals along the reasoning process, but training reliable PRMs often requires step annotations or heavy verification pipelines, making them expensive to scale and refresh during online RL. Implicit PRMs mitigate this cost by learning decomposable token- or step-level rewards from trajectory-level outcome labels. However, they suffer from a train-inference mismatch: training only constrains a sequence-level aggregate, whereas inference requires token-level scores to reflect local step quality. As a result, token-level credits are weakly identified and may fail to faithfully reflect which reasoning steps are actually correct. This unreliability undermines a key promise of implicit PRMs: scoring many candidate tokens. In practice, noisy per-token advantages may systematically reinforce incorrect continuations. We address this…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.