Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

Alex Nikulkov

arXiv:2604.22981·cs.LG·April 28, 2026

Reward Models Are Secretly Value Functions: Temporally Coherent Reward Modeling

Alex Nikulkov

PDF

TL;DR

This paper introduces Temporally Coherent Reward Modeling (TCRM), a method that aligns reward models with value functions, improving interpretability and efficiency in reinforcement learning from human feedback.

Contribution

TCRM adds regularization to reward models to make token-level outputs represent conditional expectations, connecting reward modeling with RL value functions without changing architecture or data.

Findings

01

Token-level reward accuracy improved from 50% to 88.9%.

02

Achieved state-of-the-art PRM performance on ProcessBench.

03

Reduced GPU memory by 27% and step time by 19% in PPO.

Abstract

Reward models in RLHF are trained to score only the final token of a response - a choice that discards rich signal from every intermediate position and produces models whose token-level outputs are noise. We argue this is a missed opportunity: a well-trained reward model's output at any token should represent the conditional expectation of the final reward given the response so far. We introduce Temporally Coherent Reward Modeling (TCRM), which induces this property via two regularization terms on top of the standard Bradley-Terry loss, with minimizers provably equal to conditional expectations. The regularizers correspond to Monte Carlo and TD value-learning objectives, establishing a direct connection to RL value functions. TCRM requires zero changes to architecture, data, or inference, yet unlocks three capabilities from one principle: interpretable token-level reward trajectories…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.