Sharpness-Guided Group Relative Policy Optimization via Probability Shaping
Tue Le, Linh Ngo Van, Trung Le

TL;DR
This paper introduces GRPO-SG, a simple token-weighted modification to Group Relative Policy Optimization, which reduces sharp updates and enhances generalization in reinforcement learning with verifiable rewards.
Contribution
It proposes a novel sharpness-guided variant of GRPO that improves generalization and stability by downweighting tokens with large gradient norms.
Findings
GRPO-SG outperforms standard GRPO in reasoning tasks.
It results in smoother gradient trajectories during training.
Experimental results demonstrate improved generalization across multiple tasks.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. However, RLVR training is typically performed with limited control over generalization. We revisit GRPO through a robustness-based generalization view, where the generalization loss is upper bounded by a combination of the empirical loss and a sharpness surrogate measured by the gradient norm. Building on this perspective, we propose Sharpness-Guided GRPO (GRPO-SG), a simple token-weighted variant of GRPO that downweights tokens likely to cause overly large gradients, reducing sharp updates and stabilizing optimization, thereby improving generalization. Experiments across mathematical reasoning, logic puzzles and tool-augmented question answering show consistent improvements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
