TL;DR
This paper introduces TIC-GRPO, an efficient and provably convergent reinforcement learning algorithm for fine-tuning large language models from human feedback, improving upon existing critic-free methods.
Contribution
It proposes TIC-GRPO, a novel trajectory-level importance correction method with convergence guarantees, and demonstrates its superior performance over prior critic-free algorithms.
Findings
TIC-GRPO converges faster than GRPO.
TIC-GRPO achieves comparable or better performance on math reasoning and coding tasks.
The simplified importance sampling variant performs similarly to standard GRPO.
Abstract
Group Relative Policy Optimization (GRPO), recently introduced by DeepSeek, is a critic-free reinforcement learning algorithm for fine-tuning large language models. GRPO replaces the value function in Proximal Policy Optimization (PPO) with group-normalized rewards while retaining PPO-style token-level importance sampling based on an old policy. Our theoretical analysis reveals that the GRPO update rule estimates the policy gradient at the old policy rather than the current one; however, since the old policy is refreshed every few steps, the resulting discrepancy remains small and the induced bias is negligible in practice. To empirically validate this insight, we conduct an ablation study that entirely removes importance sampling and performs multiple optimization steps using gradients estimated at a fixed old policy. Remarkably, this simplified variant attains performance comparable…
Peer Reviews
Decision·Submitted to ICLR 2026
- Originality: The work is the first to reveal that GRPO essentially performs gradient estimation at the old policy, and it uses ablation studies to validate this insight, laying an intuitive foundation for further improvements. - Clarity: Concepts, formulas, and proofs are well presented, and the appendices are comprehensive; however, the meaning of some symbols is not explained.
1. In Eq. (7), the subsequent proofs bound some error terms by problem-dependent constants, whereas other bounds are independent of hyper-parameters. Yet in RL the policy changes little between two consecutive steps. What, then, is the justification for decomposing the expression into so many terms in Eq. (7)? 2. The upper bound in the theorem does not contain the hyper-parameters $\epsilon_{high}$ and $\epsilon_{low}$. Does this mean their values do not affect the bound? If so, can the bound be
- Clear theoretical motivation and correction. The decomposition in Eq. 7 demonstrates that GRPO’s update estimates ∇J at π_old rather than π, and TIC-GRPO’s trajectory-level ratio restores unbiasedness. The analysis bridges empirical intuition with formal theory. - Provable convergence guarantees. Theorems 5.1–5.2 give the first formal stationary-point convergence bounds for GRPO-style methods, showing improved asymptotic dependence after removing terms M_N and σ²_sT,N. - Simple yet effective m
- Limited originality relative to concurrent work. The key modification—trajectory-level ratios—is nearly identical to GSPO (Zheng et al., 2025), which the authors acknowledge. While TIC-GRPO adds theoretical analysis and slightly different normalization, the conceptual leap is incremental. - Experiments are limited in scope. The evaluation focuses on AIME reasoning benchmarks, which are small-scale and synthetic. It’s unclear whether TIC-GRPO generalizes to more diverse RLHF settings (e.g., pre
This work provides the first rigorous convergence analysis for GRPO-style methods, a popular class of critic-free RLHF algorithms. By establishing formal convergence guarantees under standard assumptions, the paper fills a critical theoretical gap in the literature. The convergence analysis is built on a solid foundation of standard and reasonable assumptions. The paper delivers a crucial and insightful finding and elegantly explains why GRPO works in practice despite the bias. This theoretical
**Narrow and Potentially Insufficient Empirical Validation**: Conducting experiments on only one benchmark (AIME) is highly unusual and insufficient to establish generalizability. A review of other GRPO-related papers (e.g., DeepSeekMath, GSPO) shows they typically use multiple benchmarks. The failure to include, for example, AIME-25, significantly weakens the persuasiveness of the empirical claims. **Lack of Experiments Directly Supporting Theoretical Claims**: A major contribution is the co
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
