DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

Dingwei Zhu; Zhiheng Xi; Shihan Dou; Yuhui Wang; Sixian Li; Junjie Ye; Honglin Guo; Shichun Liu; Chenhao Huang; Yajie Yang; Junlin Shang; Senjie Jin; Ming Zhang; Jiazheng Zhang; Caishuang Huang; Yunke Zhang; Yuran Wang; Tao Gui

arXiv:2512.03847·cs.LG·May 7, 2026

DVPO: Distributional Value Modeling-based Policy Optimization for LLM Post-Training

Dingwei Zhu, Zhiheng Xi, Shihan Dou, Yuhui Wang, Sixian Li, Junjie Ye, Honglin Guo, Shichun Liu, Chenhao Huang, Yajie Yang, Junlin Shang, Senjie Jin, Ming Zhang, Jiazheng Zhang, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui

PDF

TL;DR

DVPO is a novel reinforcement learning framework that leverages distributional value modeling and risk-aware policy optimization to enhance robustness and generalization in large language model post-training under noisy supervision.

Contribution

It introduces a new RL method combining distributional value modeling with asymmetric risk regularization, improving stability and performance in noisy, real-world scenarios.

Findings

01

DVPO outperforms PPO, GRPO, and robust Bellman-based PPO in diverse tasks.

02

It effectively balances robustness and exploration through tail shaping.

03

Experiments demonstrate improved generalization in multi-turn dialogue, math reasoning, and scientific QA.

Abstract

Reinforcement learning (RL) has shown strong performance in LLM post-training, but real-world deployment often involves noisy or incomplete supervision. In such settings, complex and unreliable supervision signals can destabilize training and harm generalization. While existing approaches such as worst-case optimization (e.g., RFQI, CQL) and mean-based methods (e.g., PPO, GRPO) can improve stability, they often overlook generalization and may produce overly conservative policies, leading to uneven performance across diverse real scenarios. To this end, we introduce DVPO (Distributional Value Modeling with Risk-aware Policy Optimization), a new RL framework that combines conditional risk theory with distributional value modeling to better balance robustness and generalization. DVPO learns token-level value distributions to provide fine-grained supervision, and applies an asymmetric risk…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.