TL;DR
This paper introduces the cumulative token importance sampling ratio for unbiased and lower-variance policy gradient estimation in LLM training, leading to improved performance on reasoning benchmarks.
Contribution
It proposes the theoretically grounded cumulative token IS ratio and the CTPO algorithm, combining unbiasedness with variance reduction and adaptive clipping.
Findings
CTPO achieves the best average performance on reasoning benchmarks.
The cumulative token IS ratio provides unbiased prefix correction.
CTPO outperforms GRPO and GSPO baselines in experiments.
Abstract
Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
