On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization
Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

TL;DR
This paper investigates the negative gradient effects in group relative deep reinforcement learning, identifies the cause of Lazy Likelihood Displacement, and proposes NTHR to improve model performance on reasoning tasks.
Contribution
It provides a theoretical analysis of GRPO's learning dynamics, introduces NTHR to mitigate LLD, and demonstrates improved reasoning performance across multiple models.
Findings
NTHR effectively reduces Lazy Likelihood Displacement.
Models with NTHR show consistent performance gains on math reasoning benchmarks.
Theoretical analysis links LLD to naive token penalization in GRPO.
Abstract
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD.…
Peer Reviews
Decision·NeurIPS 2025 poster
Strengths: This paper identifies and analyzes a previously overlooked issue—Lazy Likelihood Displacement (LLD)—in GRPO-based reinforcement learning for LLMs, and proposes a novel solution, NTHR, that selectively reduces penalties on certain tokens to mitigate this effect. The work is well-motivated, theoretically grounded, and supported by comprehensive experiments across multiple model sizes, demonstrating consistent performance gains and strong practical relevance. Weaknesses: 1. The author
Strenghts: - The paper is well-structured and provides a clear, step-by-step explanation of the problem. - The paper is sound, the theorem and the lemma are correct. Weaknesses: - The paper lacks some theoretical analysis of the impact of using the new loss (convergence to an optimal policy). - The experiments are limited to Qwen2.5 on math reasonning. It has been shown that this model can exhibit unexpected behavior, especially in this setting, compared to others [1]. It would be better to conf
Strengths: The theoretical analysis in this paper is robust, although Assumption 4.3 might be somewhat oversimplified. Weaknesses: - The experimental analysis lacks comprehensiveness, as the paper only tests on mathematical datasets. - Although LLD appears intuitively correct that it might lead to decreased performance. Nevertheless, the paper neither provides theoretical analysis nor experimental examples to support this claim.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computing and Algorithms
