Intra-Trajectory Consistency for Reward Modeling
Chaoyang Zhou, Shunyu Liu, Zengmao Wang, Di Wang, Rong-Cheng Tu, Bo Du, Dacheng Tao

TL;DR
This paper introduces intra-trajectory consistency regularization for reward models in LLMs, enhancing fine-grained reward learning and improving policy alignment and inference verification.
Contribution
It proposes a novel intra-trajectory consistency regularization method that propagates supervision signals across response processes for better reward modeling.
Findings
Improves reward model performance on RewardBench
Leads to better DPO-aligned policies
Enhances inference-time verification results
Abstract
Reward models are critical for improving large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) or inference-time verification. Current reward modeling typically relies on scores of overall responses to learn the outcome rewards for the responses. However, since the response-level scores are coarse-grained supervision signals, the reward model struggles to identify the specific components within a response trajectory that truly correlate with the scores, leading to poor generalization on unseen responses. In this paper, we propose to leverage generation probabilities to establish reward consistency between processes in the response trajectory, which allows the response-level supervisory signal to propagate across processes, thereby providing additional fine-grained signals for reward learning. Building on analysis under the Bayesian framework, we…
Peer Reviews
Decision·Submitted to ICLR 2026
- It supplements the standard BT outcome loss; no extra labels are needed, just next-token probabilities from a (frozen) generator. - With only response-level labels, ICRM approaches process-supervised PRMs and even boosts a process-reward model when combined. - Improvements hold across reward model benchmarks, RLHF policies, and inference-time verification, and extend to code generation.
[minor weakness] - table2 reasoning section, the bold is wrongly inserted (Classifier + label smooth shows higher performance) - Figure2, two methods are not distiguishable, it would be better to use different color to be compared better - Figure3, in headmap, the color bar to show the scale is missed, and it would be better to use dense color scale to show the difference clear [weakness] - You weight consistency by the model’s next-token probability, but the generator is not guaranteed to be
The primary strength of this paper is its novel, intuitive, and highly practical regularization method. By linking reward consistency to generator probabilities, it offers a good approach to inject fine-grained learning signals from coarse-grained data, presenting a significant practical advantage over methods that rely on labor-intensive, step-wise human annotations. This core contribution is supported by rigorous experimental evaluation. The authors convincingly demonstrate that improvements o
Despite its strengths, the paper possesses several weaknesses that should be addressed: 1) The evaluation lacks sufficient comparison to RL-based methods methods. Basically, the discussions of reward modeling are often about PPO-like or GRPO-like algorithms; thus, the paper would be strengthened by presenting extensive results comparing ICRM-enhanced DPO against these methods on more benchmark datasets. Furthermore, a key motivation for DPO is its simplicity relative to complex RL-based pipelin
1. The theoretical analysis effectively supports the arguments. 2. Introducing token-wise information during the reward model training phase demonstrates innovation.
1. The additional computational overhead introduced during training is non-negligible. Training a 2B reward model along with a 2B generator should be compared with an ablation study involving training a standalone 3B–4B reward model for a more appropriate evaluation. 2. Generator mismatch is a common issue. On one hand, during RLHF, the reward model size may be significantly smaller than the actor model. On the other hand, the distribution of the actor model can shift considerably as training pr
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics
