DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
Hongbo Jin, Rongpeng Zhu, Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, Jiayu Ding

TL;DR
DGPO introduces a critic-free reinforcement learning framework that enhances fine-grained credit assignment in language models by replacing KL divergence with Hellinger distance and employing entropy gating.
Contribution
It proposes a novel distribution deviation reinterpretation and entropy gating mechanism, enabling more precise token-level credit assignment without additional value networks.
Findings
DGPO achieves state-of-the-art critic-free alignment performance.
On Qwen2.5-32B, DGPO attains 60.0% Avg@32 accuracy on AIME2024.
Substantially outperforms baselines like DAPO.
Abstract
Reinforcement learning is crucial for aligning large language models to perform complex reasoning tasks. However, current algorithms such as Group Relative Policy Optimization suffer from coarse grained, sequence level credit assignment, which severely struggles to isolate pivotal reasoning steps within long Chain of Thought generations. Furthermore, the standard unbounded Kullback Leibler divergence penalty induces severe gradient instability and mode seeking conservatism, ultimately stifling the discovery of novel reasoning trajectories. To overcome these limitations, we introduce Distribution Guided Policy Optimization, a novel critic free reinforcement learning framework that reinterprets distribution deviation as a guiding signal rather than a rigid penalty. DGPO replaces the volatile KL divergence with the bounded Hellinger distance to safely quantify token level exploration…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
