Intra-Trajectory Consistency for Reward Modeling

Chaoyang Zhou; Shunyu Liu; Zengmao Wang; Di Wang; Rong-Cheng Tu; Bo Du; Dacheng Tao

arXiv:2506.09096·cs.LG·June 17, 2025

Intra-Trajectory Consistency for Reward Modeling

Chaoyang Zhou, Shunyu Liu, Zengmao Wang, Di Wang, Rong-Cheng Tu, Bo Du, Dacheng Tao

PDF

Open Access 3 Reviews

TL;DR

This paper introduces intra-trajectory consistency regularization for reward models in LLMs, enhancing fine-grained reward learning and improving policy alignment and inference verification.

Contribution

It proposes a novel intra-trajectory consistency regularization method that propagates supervision signals across response processes for better reward modeling.

Findings

01

Improves reward model performance on RewardBench

02

Leads to better DPO-aligned policies

03

Enhances inference-time verification results

Abstract

Reward models are critical for improving large language models (LLMs), particularly in reinforcement learning from human feedback (RLHF) or inference-time verification. Current reward modeling typically relies on scores of overall responses to learn the outcome rewards for the responses. However, since the response-level scores are coarse-grained supervision signals, the reward model struggles to identify the specific components within a response trajectory that truly correlate with the scores, leading to poor generalization on unseen responses. In this paper, we propose to leverage generation probabilities to establish reward consistency between processes in the response trajectory, which allows the response-level supervisory signal to propagate across processes, thereby providing additional fine-grained signals for reward learning. Building on analysis under the Bayesian framework, we…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- It supplements the standard BT outcome loss; no extra labels are needed, just next-token probabilities from a (frozen) generator. - With only response-level labels, ICRM approaches process-supervised PRMs and even boosts a process-reward model when combined. - Improvements hold across reward model benchmarks, RLHF policies, and inference-time verification, and extend to code generation.

Weaknesses

[minor weakness] - table2 reasoning section, the bold is wrongly inserted (Classifier + label smooth shows higher performance) - Figure2, two methods are not distiguishable, it would be better to use different color to be compared better - Figure3, in headmap, the color bar to show the scale is missed, and it would be better to use dense color scale to show the difference clear [weakness] - You weight consistency by the model’s next-token probability, but the generator is not guaranteed to be

Reviewer 02Rating 4Confidence 4

Strengths

The primary strength of this paper is its novel, intuitive, and highly practical regularization method. By linking reward consistency to generator probabilities, it offers a good approach to inject fine-grained learning signals from coarse-grained data, presenting a significant practical advantage over methods that rely on labor-intensive, step-wise human annotations. This core contribution is supported by rigorous experimental evaluation. The authors convincingly demonstrate that improvements o

Weaknesses

Despite its strengths, the paper possesses several weaknesses that should be addressed: 1) The evaluation lacks sufficient comparison to RL-based methods methods. Basically, the discussions of reward modeling are often about PPO-like or GRPO-like algorithms; thus, the paper would be strengthened by presenting extensive results comparing ICRM-enhanced DPO against these methods on more benchmark datasets. Furthermore, a key motivation for DPO is its simplicity relative to complex RL-based pipelin

Reviewer 03Rating 2Confidence 4

Strengths

1. The theoretical analysis effectively supports the arguments. 2. Introducing token-wise information during the reward model training phase demonstrates innovation.

Weaknesses

1. The additional computational overhead introduced during training is non-negligible. Training a 2B reward model along with a 2B generator should be compared with an ablation study involving training a standalone 3B–4B reward model for a more appropriate evaluation. 2. Generator mismatch is a common issue. On one hand, during RLHF, the reward model size may be significantly smaller than the actor model. On the other hand, the distribution of the actor model can shift considerably as training pr

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Explainable Artificial Intelligence (XAI) · Reinforcement Learning in Robotics