GTPO: Stabilizing Group Relative Policy Optimization via Gradient and Entropy Control
Marco Simoni, Aleksandar Fontana, Giulio Rossolini, Andrea Saracino, Paolo Mori

TL;DR
This paper introduces GTPO, a novel policy optimization method that enhances Large Language Model alignment by stabilizing training through gradient and entropy control, addressing issues in existing GRPO methods.
Contribution
GTPO proposes gradient skipping and entropy filtering techniques to improve training stability and performance without relying on KL-divergence regularization.
Findings
GTPO achieves more stable training compared to GRPO.
GTPO improves performance on multiple benchmark datasets.
GTPO does not require a reference model during training.
Abstract
Group Relative Policy Optimization (GRPO) is a promising policy-based approach for Large Language Model alignment, yet its performance is often limited by training instability and suboptimal convergence. In this paper, we identify and analyze two main GRPO issues: (i) the token-level penalization, where valuable tokens shared across different responses receive contradictory feedback signals, leading to conflicting gradient updates that can reduce their likelihood; and (ii) the policy collapse, where negatively rewarded completions may penalize confident responses and shift model decisions toward unlikely tokens, destabilizing training process. To address these issues we introduce GTPO (Group-relative Trajectory-based Policy Optimization), which prevents conflicting gradients on valuable tokens by skipping negative updates while amplifying positive ones and filters out completions whose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Reinforcement Learning in Robotics · Topic Modeling
