From $\log \pi$ to $\pi$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

Xiaoliang Fu; Jiaye Lin; Yangyi Fang; Chaowen Hu; Cong Qin; Zekai Shao; Binbin Zheng; Lu Pan; Ke Zeng

arXiv:2603.14389·cs.LG·April 21, 2026

From $\log \pi$ to $\pi$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng

PDF

1 Repo

TL;DR

This paper introduces DGPO, a novel reinforcement learning optimization method for large language models that stabilizes training by decoupling probability gradient decay, improving exploration and performance.

Contribution

It proposes a decoupled decay mechanism based on importance sampling ratios, addressing divergence issues in soft clipping for RLVR in LLM training.

Findings

01

DGPO outperforms strong baselines on mathematical benchmarks.

02

Extensive experiments on models from 1.5B to 14B parameters show improved stability and exploration.

03

The code is publicly available at https://github.com/FlyTune/DGPO-RL.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has catalyzed a leap in Large Language Model (LLM) reasoning, yet its optimization dynamics remain fragile. Standard algorithms like GRPO enforce stability via "hard clipping", which inadvertently stifles exploration by discarding gradients of tokens outside the trust region. While recent "soft clipping" methods attempt to recover these gradients, they suffer from a critical challenge: relying on log-probability gradient ( $\nabla_{θ} lo g π_{θ}$ ) yields divergent weights as probabilities vanish, destabilizing LLM training. We rethink this convention by establishing probability gradient ( $\nabla_{θ} π_{θ}$ ) as the superior optimization primitive. Accordingly, we propose Decoupled Gradient Policy Optimization (DGPO), which employs a decoupled decay mechanism based on importance sampling ratios. By applying asymmetric,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FlyTune/DGPO-RL
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.