SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

Tianyi Hu; Qingxu Fu; Yanxi Chen; Zhaoyang Liu; Bolin Ding

arXiv:2602.06554·cs.AI·February 9, 2026

SeeUPO: Sequence-Level Agentic-RL with Convergence Guarantees

Tianyi Hu, Qingxu Fu, Yanxi Chen, Zhaoyang Liu, Bolin Ding

PDF

Open Access

TL;DR

This paper introduces SeeUPO, a new RL algorithm for large language model agents that guarantees convergence in multi-turn interactions by modeling them as sequential multi-agent bandit problems, improving stability and performance.

Contribution

We propose SeeUPO, a critic-free, sequence-level RL method with proven convergence guarantees for multi-turn LLM agent training, addressing limitations of existing algorithms.

Findings

01

SeeUPO achieves 43.3%-54.6% relative gains on Qwen3-14B.

02

SeeUPO demonstrates superior training stability over existing algorithms.

03

It guarantees convergence to the global optimum in multi-turn scenarios.

Abstract

Reinforcement learning (RL) has emerged as the predominant paradigm for training large language model (LLM)-based AI agents. However, existing backbone RL algorithms lack verified convergence guarantees in agentic scenarios, especially in multi-turn settings, which can lead to training instability and failure to converge to optimal policies. In this paper, we systematically analyze how different combinations of policy update mechanisms and advantage estimation methods affect convergence properties in single/multi-turn scenarios. We find that REINFORCE with Group Relative Advantage Estimation (GRAE) can converge to the globally optimal under undiscounted conditions, but the combination of PPO & GRAE breaks PPO's original monotonic improvement property. Furthermore, we demonstrate that mainstream backbone RL algorithms cannot simultaneously achieve both critic-free and convergence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Topic Modeling · Multimodal Machine Learning Applications