Not all tokens are needed(NAT): token efficient reinforcement learning
Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang

TL;DR
NAT introduces a method to reduce token usage in reinforcement learning for language models by selectively updating only a subset of tokens, maintaining performance while lowering computational costs.
Contribution
The paper proposes NAT, a novel framework that employs unbiased partial-token policy-gradient estimation to efficiently scale RL with long sequences.
Findings
NAT matches full-token RL performance with only 50% token updates.
RPC reduces GPU memory by 18% and training time by 29%.
NAT enables more scalable RL for long chain-of-thought trajectories.
Abstract
Reinforcement learning (RL) has become a key driver of progress in large language models, but scaling RL to long chain-of-thought (CoT) trajectories is increasingly constrained by backpropagation over every generated token. Even with optimized rollout engines, full-token updates can consume a large fraction of total training cost, turning token length into a hidden tax on RL. We introduce Not All Tokens Are Needed (NAT), a unified framework that makes the token budget a first-class optimization primitive. NAT updates the policy using only a selected subset of generated tokens while preserving the learning signal of full-sequence RL. The core idea is an unbiased partial-token policy-gradient estimator via Horvitz-Thompson reweighting, which ensures statistically correct gradients despite subsampling. We instantiate NAT with two simple, plug-and-play token selection schemes: Uniform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning in Materials Science
