Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization

Chenliang Li; Adel Elmahdy; Alex Boyd; Zhongruo Wang; Siliang Zeng; Alfredo Garcia; Parminder Bhatia; Taha Kass-Hout; Cao Xiao; Mingyi Hong

arXiv:2511.20718·cs.LG·February 26, 2026

Stabilizing Off-Policy Training for Long-Horizon LLM Agent via Turn-Level Importance Sampling and Clipping-Triggered Normalization

Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Siliang Zeng, Alfredo Garcia, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Mingyi Hong

PDF

Open Access

TL;DR

This paper introduces SORL, a novel framework that stabilizes off-policy reinforcement learning for long-horizon multi-turn language model agents by addressing instability sources like token-turn mismatch and high-variance updates.

Contribution

The paper proposes SORL, a new method with mechanisms to align policy optimization with multi-turn interactions and suppress unreliable off-policy updates, improving training stability for LLM agents.

Findings

01

SORL prevents training instabilities and performance collapse.

02

It maintains lower clipping ratios and more stable optimization trajectories.

03

SORL achieves superior or comparable task performance across benchmarks.

Abstract

Reinforcement learning (RL) algorithms such as PPO and GRPO are widely used to train large language models (LLMs) for multi-turn agentic tasks. However, in off-policy training pipelines, these methods often exhibit unstable optimization dynamics and are prone to performance collapse. Through empirical analysis, we identify two fundamental sources of instability in this setting: (1)~a granularity mismatch between token-level policy optimization and turn-structured interactions, and (2) high-variance and unreliable gradient updates induced by off-policy importance sampling and inaccurate advantage estimation. To address these challenges, we propose SORL, \underline{S}tabilizing \underline{O}ff-Policy \underline{R}einforcement \underline{L}earning for Long-Horizon Agent Training. SORL introduces principled mechanisms that align policy optimization with the structure of multi-turn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques