Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

Yuheng Zhang; Chenlu Ye; Shuowei Jin; Changlong Yu; Wei Xiong; Saurabh Sahu; Nan Jiang

arXiv:2605.07331·cs.LG·May 11, 2026

Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

Yuheng Zhang, Chenlu Ye, Shuowei Jin, Changlong Yu, Wei Xiong, Saurabh Sahu, Nan Jiang

PDF

1 Repo

TL;DR

This paper introduces the cumulative token importance sampling ratio for unbiased and lower-variance policy gradient estimation in LLM training, leading to improved performance on reasoning benchmarks.

Contribution

It proposes the theoretically grounded cumulative token IS ratio and the CTPO algorithm, combining unbiasedness with variance reduction and adaptive clipping.

Findings

01

CTPO achieves the best average performance on reasoning benchmarks.

02

The cumulative token IS ratio provides unbiased prefix correction.

03

CTPO outperforms GRPO and GSPO baselines in experiments.

Abstract

Reinforcement learning, including reinforcement learning with verifiable rewards (RLVR), has emerged as a powerful approach for LLM post-training. Central to these approaches is the design of the importance sampling (IS) ratio used in off-policy policy-gradient estimation. Existing methods face a fundamental bias-variance dilemma: token-level IS ratios, as adopted by PPO (Schulman et al., 2017) and GRPO (Shao et al., 2024), introduce bias by ignoring prefix state distribution mismatch; full sequence ratios provide exact trajectory-level correction but suffer from high variance due to the multiplicative accumulation of per-token ratios, while GSPO (Zheng et al., 2025) improves numerical stability via length normalization at the cost of deviating from the exact full-sequence IS correction. In this work, we identify the cumulative token IS ratio, the product of per-token ratios up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

horizon-llm/CTPO
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.