Pretrain Value, Not Reward: Decoupled Value Policy Optimization
Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Zhixu Li, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, Qi Zhang

TL;DR
This paper proposes a novel reinforcement learning framework called Decoupled Value Policy Optimization (DVPO) that pretrains a global value model offline, simplifying and stabilizing RLHF by eliminating the need for online critic training.
Contribution
The paper introduces DVPO, a method that pretrains a universal value model offline to guide policy learning, reducing complexity and improving stability in RLHF.
Findings
DVPO matches or surpasses state-of-the-art RLHF methods.
Pretraining a global value model simplifies reinforcement learning.
The approach reduces critic drift and trajectory sampling issues.
Abstract
In this paper, we explore how directly pretraining a value model simplifies and stabilizes reinforcement learning from human feedback (RLHF). In reinforcement learning, value estimation is the key to policy optimization, distinct from reward supervision. The value function predicts the \emph{return-to-go} of a partial answer, that is, how promising the partial answer is if it were continued to completion. In RLHF, however, the standard pipeline first pretrains a reward model and then learns a value function online, even though no new reward signals are available once preference data is collected. This makes critic learning redundant, as the process of training a reward model and then deriving a value model is informationally equivalent to directly pretraining a value model. Importantly, this requires no additional supervision, and our value model is trained on exactly the same data used…
Peer Reviews
Decision·ICLR 2026 Poster
Clear, impactful simplification: Recasting RLHF as policy-only optimization with a pretrained value is elegant and practically meaningful. Solid theory–practice bridge: The equivalence lemma and convergence corollary directly justify the algorithmic design. Consistent empirical gains & breadth: Improvements across models/benchmarks in both base and instruction settings; strong compute-efficiency results. Token-level credit & interpretability: Concrete examples demonstrate fine-grained attributio
Fixed-feedback assumption (no new rewards). DVPO’s theory and setup hinge on no additional reward during training; if limited online human feedback arrives, the frozen GVM cannot adapt. Q: If limited online feedback becomes available, could you support periodic GVM refresh (semi-online DVPO) while retaining the stability guarantee, and what parts of Lemma 3.1 / the corollary would need to change? Scope of evaluations. Benchmarks are mainstream chat-style; tougher code/maths/long-horizon tasks m
1. **Clear presentation and theoretical rigor**: The paper is well-structured with clear insights in the introduction, particularly regarding the equivalence between training reward models and critic models based on fixed feedback. The visualization effectively depicts algorithmic differences between PPO and DVPO, and the theoretical analysis is well-organized. 2. **Novel algorithmic contribution**: The proposed alignment algorithm offers a compelling alternative to the conventional two-stage RL
1. **Unclear mechanism of performance gains**: The underlying reasons for DVPO outperforming baselines remain insufficiently explained. The value model trained on pre-collected preference feedback effectively learns $V^{\pi_B}$ (assuming responses are collected from some behavior policy $\pi_B$), which contradicts the paper's claim that this value model is a "global" one (definitely not $V^*$). Additionally, the paper attributes superiority over sentence-level reward-based methods to GVM providi
1. The paper is well-motivated and the equivalence between reward-then-critic pipelines and direct value pretraining is well-argued and convincingly formalized, highlighting under-recognized redundancy in the RLHF paradigm. 2. The framework is particularly interesting to the community as DVPO simplifies RLHF engineering by eliminating online critic training, reducing GPU memory use (40%) and time (35%), thus enabling larger models or faster iteration with fewer resources. 3. Experiments, ablat
1. The framework assumes the offline preference or reward dataset is broad enough for effective generalization. 2. In scenarios where additional human or environmental feedback can be injected mid-training, DVPO might not benefit from it.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComplex Systems and Decision Making
MethodsEntropy Regularization · Proximal Policy Optimization
