Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling
Alex Nikulkov

TL;DR
This paper introduces Temporally Coherent Reward Modeling (TCRM), a method that aligns reward models with value functions, improving interpretability and efficiency in reinforcement learning from human feedback.
Contribution
TCRM adds regularization to reward models to make token-level outputs represent conditional expectations, connecting reward modeling with RL value functions without changing architecture or data.
Findings
Token-level reward accuracy improved from 50% to 88.9%.
Achieved state-of-the-art PRM performance on ProcessBench.
Reduced GPU memory by 27% and step time by 19% in PPO.
Abstract
Reward models in RLHF are trained to score only the final token of a response - a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise. We argue this is a missed opportunity: a well-trained reward model's output at any token should represent the conditional expectation of the final reward given the response so far. We introduce Temporally Coherent Reward Modeling (TCRM), which induces this property via two regularization terms on top of the standard Bradley-Terry loss, with minimizers provably equal to conditional expectations. The regularizers correspond to Monte Carlo and TD value-learning objectives, establishing a direct connection to RL value functions. TCRM requires zero changes to architecture, data, or inference, yet unlocks three capabilities from one principle: interpretable token-level reward trajectories…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
