TL;DR
This paper introduces DGPO, a novel reinforcement learning optimization method for large language models that stabilizes training by decoupling probability gradient decay, improving exploration and performance.
Contribution
It proposes a decoupled decay mechanism based on importance sampling ratios, addressing divergence issues in soft clipping for RLVR in LLM training.
Findings
DGPO outperforms strong baselines on mathematical benchmarks.
Extensive experiments on models from 1.5B to 14B parameters show improved stability and exploration.
The code is publicly available at https://github.com/FlyTune/DGPO-RL.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via "hard clipping", which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent "soft clipping" methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient () yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient () as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
