Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Yingru Li; Jiacai Liu; Jiawei Xu; Yuxuan Tong; Ziniu Li; Qian Liu; Baoxiang Wang

arXiv:2512.23075·cs.LG·March 2, 2026

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Yingru Li, Jiacai Liu, Jiawei Xu, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang

PDF

Open Access

TL;DR

This paper introduces Trust Region Masking, a novel method that provides the first non-vacuous guarantees for long-horizon reinforcement learning with large language models by controlling sequence-level divergences.

Contribution

It derives new divergence bounds that scale better with sequence length and proposes Trust Region Masking to ensure monotonic improvement in long-horizon LLM-RL.

Findings

01

Derived divergence bounds with improved scaling ($O(T^{3/2})$, $O(T)$)

02

Proposed Trust Region Masking to control sequence divergence

03

Achieved the first non-vacuous guarantees for long-horizon LLM-RL

Abstract

Policy gradient methods for Large Language Models optimize a policy $π_{θ}$ via a surrogate objective computed from samples of a rollout policy $π_{roll}$ . However, modern LLM-RL pipelines suffer from unavoidable implementation divergences -- backend discrepancies, Mixture-of-Experts routing discontinuities, and distributed training staleness -- causing off-policy mismatch ( $π_{roll} \neq = π_{θ}$ ) and approximation errors between the surrogate and the true objective. We demonstrate that classical trust region bounds on this error scale as $O (T^{2})$ with sequence length $T$ , rendering them vacuous for long-horizon tasks. To address this, we derive a family of bounds -- both KL-based and TV-based -- including a Pinsker-Marginal bound ( $O (T^{3/2})$ ), a Mixed bound ( $O (T)$ ), and an Adaptive bound that strictly generalizes the Pinsker-Marginal bound via per-position…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning