VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision

Dingwei Zhu; Shihan Dou; Zhiheng Xi; Senjie Jin; Guoqiang Zhang; Jiazheng Zhang; Junjie Ye; Mingxu Chai; Enyu Zhou; Ming Zhang; Caishuang Huang; Yunke Zhang; Yuran Wang; Tao Gui

arXiv:2508.03058·cs.LG·August 6, 2025

VRPO: Rethinking Value Modeling for Robust RL Training under Noisy Supervision

Dingwei Zhu, Shihan Dou, Zhiheng Xi, Senjie Jin, Guoqiang Zhang, Jiazheng Zhang, Junjie Ye, Mingxu Chai, Enyu Zhou, Ming Zhang, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui

PDF

TL;DR

This paper introduces VRPO, a value-centric framework that enhances reinforcement learning from human feedback by improving the value model's robustness to noisy supervision, leading to more stable and reliable policy training.

Contribution

VRPO proposes a novel value model enhancement with auxiliary loss and variational information bottleneck, addressing noise in reward signals during policy optimization.

Findings

01

VRPO outperforms PPO and GRPO in noisy environments.

02

The value model effectively filters noise and captures key contextual information.

03

Experiments demonstrate improved stability and generalization in RLHF tasks.

Abstract

Reinforcement Learning from Human Feedback (RLHF) often suffers from noisy or imperfect reward supervision in real-world settings, which undermines policy stability and generalization. Such noise may cause models to lose attention on key words during advantage estimation. While prior work focuses on reward denoising or filtering poor data, it often overlooks the critical role of the value model in policy optimization. In this work, we show that a strong value model is essential for mitigating noise by absorbing unstable signals and enabling more reliable advantage estimation. We propose VRPO, a value-centric framework for robust PPO training under noisy supervision. VRPO combines two core designs: (1) an auxiliary loss guided by entropy and perplexity from a frozen language model, and (2) a variational information bottleneck. These mechanisms enhance the value model's ability to filter…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.