Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Junbo Li; Peng Zhou; Rui Meng; Meet P. Vadera; Lihong Li; Yang Li

arXiv:2512.17008·cs.LG·January 27, 2026

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs

Junbo Li, Peng Zhou, Rui Meng, Meet P. Vadera, Lihong Li, Yang Li

PDF

Open Access 1 Video

TL;DR

This paper introduces turn-PPO, a novel turn-level advantage estimation method for PPO, improving multi-turn reinforcement learning in agentic language models by enhancing stability and effectiveness over traditional token-level approaches.

Contribution

The paper proposes turn-PPO, a turn-level advantage estimation strategy that outperforms GRPO and token-level PPO in multi-turn RL tasks for language models.

Findings

01

turn-PPO shows improved stability and performance in multi-turn RL tasks

02

turn-PPO outperforms GRPO and token-level PPO on WebShop and Sokoban datasets

03

turn-PPO effectively handles long-horizon reasoning in multi-turn environments

Abstract

Reinforcement learning (RL) has re-emerged as a natural approach for training interactive LLM agents in real-world environments. However, directly applying the widely used Group Relative Policy Optimization (GRPO) algorithm to multi-turn tasks exposes notable limitations, particularly in scenarios requiring long-horizon reasoning. To address these challenges, we investigate more stable and effective advantage estimation strategies, especially for multi-turn settings. We first explore Proximal Policy Optimization (PPO) as an alternative and find it to be more robust than GRPO. To further enhance PPO in multi-turn scenarios, we introduce turn-PPO, a variant that operates on a turn-level MDP formulation, as opposed to the commonly used token-level MDP. Our results on the WebShop and Sokoban datasets demonstrate the effectiveness of turn-PPO, both with and without long reasoning components.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Turn-PPO: Turn-Level Advantage Estimation with PPO for Improved Multi-Turn RL in Agentic LLMs· underline

Taxonomy

TopicsReinforcement Learning in Robotics · Multimodal Machine Learning Applications · Artificial Intelligence in Games