SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees
Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, Bolin Ding

TL;DR
This paper introduces SeeUPO, a new RL algorithm for large language model agents that guarantees convergence in multi-turn interactions by modeling them as sequential multi-agent bandit problems, improving stability and performance.
Contribution
We propose SeeUPO, a critic-free, sequence-level RL method with proven convergence guarantees for multi-turn LLM agent training, addressing limitations of existing algorithms.
Findings
SeeUPO achieves 43.3%-54.6% relative gains on Qwen3-14B.
SeeUPO demonstrates superior training stability over existing algorithms.
It guarantees convergence to the global optimum in multi-turn scenarios.
Abstract
Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications
