On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Wenlong Deng; Yi Ren; Muchen Li; Danica J. Sutherland; Xiaoxiao Li; Christos Thrampoulidis

arXiv:2505.18830·cs.LG·May 27, 2025

On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization

Wenlong Deng, Yi Ren, Muchen Li, Danica J. Sutherland, Xiaoxiao Li, Christos Thrampoulidis

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the negative gradient effects in group relative deep reinforcement learning, identifies the cause of Lazy Likelihood Displacement, and proposes NTHR to improve model performance on reasoning tasks.

Contribution

It provides a theoretical analysis of GRPO's learning dynamics, introduces NTHR to mitigate LLD, and demonstrates improved reasoning performance across multiple models.

Findings

01

NTHR effectively reduces Lazy Likelihood Displacement.

02

Models with NTHR show consistent performance gains on math reasoning benchmarks.

03

Theoretical analysis links LLD to naive token penalization in GRPO.

Abstract

Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs), with Group Relative Policy Optimization (GRPO) emerging as a widely used algorithm in recent systems. Despite GRPO's widespread adoption, we identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training. This behavior mirrors a recently discovered misalignment issue in Direct Preference Optimization (DPO), attributed to the influence of negative gradients. We provide a theoretical analysis of GRPO's learning dynamic, identifying the source of LLD as the naive penalization of all tokens in incorrect responses with the same strength. To address this, we develop a method called NTHR, which downweights penalties on tokens contributing to the LLD.…

Peer Reviews

Decision·NeurIPS 2025 poster

Reviewer 01Rating 4Confidence 5

Strengths

Strengths: This paper identifies and analyzes a previously overlooked issue—Lazy Likelihood Displacement (LLD)—in GRPO-based reinforcement learning for LLMs, and proposes a novel solution, NTHR, that selectively reduces penalties on certain tokens to mitigate this effect. The work is well-motivated, theoretically grounded, and supported by comprehensive experiments across multiple model sizes, demonstrating consistent performance gains and strong practical relevance. Weaknesses: 1. The author

Reviewer 02Rating 4Confidence 4

Strengths

Strenghts: - The paper is well-structured and provides a clear, step-by-step explanation of the problem. - The paper is sound, the theorem and the lemma are correct. Weaknesses: - The paper lacks some theoretical analysis of the impact of using the new loss (convergence to an optimal policy). - The experiments are limited to Qwen2.5 on math reasonning. It has been shown that this model can exhibit unexpected behavior, especially in this setting, compared to others [1]. It would be better to conf

Reviewer 03Rating 4Confidence 3

Strengths

Strengths: The theoretical analysis in this paper is robust, although Assumption 4.3 might be somewhat oversimplified. Weaknesses: - The experimental analysis lacks comprehensiveness, as the paper only tests on mathematical datasets. - Although LLD appears intuitively correct that it might lead to decreased performance. Nevertheless, the paper neither provides theoretical analysis nor experimental examples to support this claim.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Computing and Algorithms