The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Yingru Li; Jiawei Xu; Ziniu Li; Jiacai Liu; Wei Liu; Yuxuan Tong; Longtao Zheng; Zhenghai Xue; Yaxiang Zhang; Tianle Cai; Ge Zhang; Qian Liu; Baoxiang Wang

arXiv:2602.07078·cs.LG·February 10, 2026

The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Yingru Li, Jiawei Xu, Ziniu Li, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang

PDF

Open Access

TL;DR

This paper introduces the Optimal Token Baseline (OTB), a variance reduction method for long-horizon LLM-RL training that improves stability and efficiency by weighting gradient updates inversely to their cumulative gradient norm.

Contribution

The paper derives the OTB from first principles, proposing a practical proxy for efficient variance reduction tailored to token heterogeneity in LLM-RL.

Findings

01

Achieves training stability comparable to large group baselines with fewer tokens.

02

Reduces token consumption by over 65% in various tasks.

03

Matches performance of larger group sizes with significantly less computational cost.

Abstract

Reinforcement Learning (RL) for Large Language Models (LLMs) often suffers from training collapse in long-horizon tasks due to exploding gradient variance. To mitigate this, a baseline is commonly introduced for advantage computation; however, traditional value models remain difficult to optimize, and standard group-based baselines overlook sequence heterogeneity. Although classic optimal baseline theory can achieve global variance reduction, it neglects token heterogeneity and requires prohibitive gradient-based computation. In this work, we derive the Optimal Token Baseline (OTB) from first principles, proving that gradient updates should be weighted inversely to their cumulative gradient norm. To ensure efficiency, we propose the Logit-Gradient Proxy that approximates the gradient norm using only forward-pass probabilities. Our method achieves training stability and matches the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications