When to Stop Reusing: Dynamic Gradient Gating for Sample-Efficient RLVR
Yuchun Miao, Sen Zhang, Yuqi Zhang, Yaorui Shi, Qi Gu, Xunliang Cai, Lefei Zhang

TL;DR
This paper introduces Dynamic Gradient Gating (DGG), a method that monitors the lm_head gradient norm to prevent performance degradation in sample-efficient RLVR, significantly improving efficiency across multiple tasks.
Contribution
The paper identifies the Disproportionate Weight Divergence phenomenon and proposes DGG, a real-time gradient monitoring technique that enhances sample efficiency in RLVR.
Findings
DWD phenomenon is consistent across diverse LLMs and tasks.
DGG achieves up to 2.93x sample efficiency and 2.14x speedup.
lm_head gradient norm correlates with policy divergence and degradation.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has become the dominant paradigm for advanced reasoning in Large Language Models (LLMs), but rollout samples are expensive to obtain, making sample efficiency a critical bottleneck. A natural remedy is to reuse each rollout batch for multiple gradient updates, a standard practice in classical RL. Yet in RLVR, this amplifies policy shift, leading to severe performance degradation. Detecting the onset of degradation early enough to stop reuse remains an open and challenging problem. We close this gap by identifying the \textit{Disproportionate Weight Divergence (DWD)} phenomenon: performance degradation is synchronized with a sharp surge in the \texttt{lm\_head} weight change, while intermediate layers remain stable. Empirically, we verify that DWD emerges consistently across diverse LLMs and tasks. Theoretically, we prove that (i)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
