Trust Region Masking for Long-Horizon LLM Reinforcement Learning
Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang

TL;DR
This paper introduces Trust Region Masking, a novel method that provides the first non-vacuous guarantees for long-horizon reinforcement learning with large language models by controlling sequence-level divergences.
Contribution
It derives new divergence bounds that scale better with sequence length and proposes Trust Region Masking to ensure monotonic improvement in long-horizon LLM-RL.
Findings
Derived divergence bounds with improved scaling ($O(T^{3/2})$, $O(T)$)
Proposed Trust Region Masking to control sequence divergence
Achieved the first non-vacuous guarantees for long-horizon LLM-RL
Abstract
Policy gradient methods for Large Language Models optimize a policy via a surrogate objective computed from samples of a rollout policy . However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch () and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as with sequence length , rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound (), a Mixed bound (), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
