DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training
Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li, Junjie Ye, Honglin Guo, Shichun Liu, Chenhao Huang, Yajie Yang, Junlin Shang, Senjie Jin, Ming Zhang, Jiazheng Zhang, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui

TL;DR
DVPO is a novel reinforcement learning framework that leverages distributional value modeling and risk-aware policy optimization to enhance robustness and generalization in large language model post-training under noisy supervision.
Contribution
It introduces a new RL method combining distributional value modeling with asymmetric risk regularization, improving stability and performance in noisy, real-world scenarios.
Findings
DVPO outperforms PPO, GRPO, and robust Bellman-based PPO in diverse tasks.
It effectively balances robustness and exploration through tail shaping.
Experiments demonstrate improved generalization in multi-turn dialogue, math reasoning, and scientific QA.
Abstract
Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
