DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

Dingwei Zhu; Zhiheng Xi; Shihan Dou; Jiahan Li; Chenhao Huang; Junjie Ye; Sixian Li; Mingxu Chai; Yuhui Wang; Yajie Yang; Ming Zhang; Jiazheng Zhang; Shichun Liu; Caishuang Huang; Yunke Zhang; Yuran Wang; Tao Gui; Xipeng Qiu; Qi Zhang; Xuanjing Huang

arXiv:2602.05890·cs.LG·May 7, 2026

DFPO: Scaling Value Modeling via Distributional Flow towards Robust and Generalizable LLM Post-Training

Dingwei Zhu, Zhiheng Xi, Shihan Dou, Jiahan Li, Chenhao Huang, Junjie Ye, Sixian Li, Mingxu Chai, Yuhui Wang, Yajie Yang, Ming Zhang, Jiazheng Zhang, Shichun Liu, Caishuang Huang, Yunke Zhang, Yuran Wang, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

PDF

TL;DR

DFPO introduces a continuous flow-based distributional RL framework that enhances robustness and generalization in LLM post-training by modeling value functions as flows, stabilizing training under noisy supervision.

Contribution

It proposes a novel value flow modeling approach with risk and consistency controls, improving robustness and out-of-domain generalization in distributional RL for LLMs.

Findings

01

DFPO outperforms PPO and FlowRL on dialogue, math reasoning, and scientific tasks.

02

It achieves better training stability under noisy supervision.

03

DFPO enhances out-of-domain generalization in complex tasks.

Abstract

Training reinforcement learning (RL) systems in real-world environments remains challenging due to noisy supervision and poor out-of-domain (OOD) generalization, especially in LLM post-training. Recent distributional RL methods improve robustness by modeling values with multiple quantile points, but they still learn each quantile independently as a scalar. This results in rough-grained value representations that lack fine-grained conditioning on state information, struggling under complex and OOD conditions. We propose DFPO (Distributional Value Flow Policy Optimization with Conditional Risk and Consistency Control), a robust distributional RL framework that models values as continuous flows across time steps. By scaling value modeling through learning of a value flow field instead of isolated quantile predictions, DFPO captures richer state information for more accurate advantage…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.