Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design

Quan Wei; Siliang Zeng; Chenliang Li; William Brown; Oana Frunza; Wei Deng; Anderson Schneider; Yuriy Nevmyvaka; Yang Katie Zhao; Alfredo Garcia; Mingyi Hong

arXiv:2505.11821·cs.LG·October 24, 2025

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design

Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, Mingyi Hong

PDF

Open Access 3 Reviews

TL;DR

This paper introduces turn-level reward design for multi-turn RL in LLM agents, significantly improving reasoning performance, stability, and convergence in complex multi-turn tasks.

Contribution

It systematically studies turn-level reward design and extends RL algorithms like GRPO and PPO for multi-turn reasoning, enabling finer credit assignment and better performance.

Findings

01

Enhanced accuracy in multi-turn reasoning tasks

02

Faster convergence and greater stability in training

03

Achieved 100% format correctness across datasets

Abstract

This paper investigates Reinforcement Learning (RL) approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents in long-horizon, multi-turn scenarios. Although RL algorithms such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) have been widely applied to train multi-turn LLM agents, they typically rely only on sparse outcome rewards and lack dense intermediate signals across multiple decision steps, limiting their performance on complex reasoning tasks. To bridge this gap, we present the first systematic study of \textit{turn-level reward design} for multi-turn RL algorithms and agent applications. By integrating turn-level rewards, we extend GRPO and PPO to their respective multi-turn variants, enabling fine-grained credit assignment. We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

- 1.The work identifies and systematically tackles a fundamental flaw in applying RL to multi-turn LLM agents: the credit assignment problem. By shifting from sparse, end-of-task rewards to dense, turn-level rewards, the method provides the agent with much richer and more immediate feedback, which is crucial for learning complex sequences of actions. - 2.The paper offers a detailed and practical framework for designing turn-level rewards, which is a significant contribution. It introduces two d

Weaknesses

- 1. **High Computational Complexity**: The proposed MT-GRPO method requires exponential trajectory samples, making it infeasible for long-horizon tasks. While MT-PPO reduces this cost via a critic model, it still introduces additional training overhead. - 2. **Fixed-Turn Constraint Limits Flexibility**: MT-GRPO mandates all rollout groups to have the same number of turns, enforced through system prompts. This rigid structure hinders adaptability to dynamic scenarios where tasks may require var

Reviewer 02Rating 2Confidence 4

Strengths

1. The paper tackles an important problem: improving multi-turn reasoning in LLM agents through better reward shaping 2. The distinction between single-turn and multi-turn MDP formulations is well presented and conceptually sound. 3. The paper is clearly written and easy to follow, with consistent notation and illustrative examples.

Weaknesses

1. The main contribution, introducing turn-level rewards into PPO/GRPO, is conceptually straightforward and closely related to prior work on process reward models (PRM) and segment-level credit assignment. The paper overstates its originality by claiming to be the “first systematic study” without adequately discussing/comparing with these prior methods. 2. The experiments are limited to search-based QA tasks, leaving it unclear whether the proposed framework generalizes to other multi-turn or op

Reviewer 03Rating 4Confidence 3

Strengths

- The algorithm is well clarified with a specific case study. - This paper studies a fundamental problem for multi-turn RL -- the use of turn-level reward. - The MT versions of PPO and GRPO show better performance compared to their counterparts: PPO and GRPO.

Weaknesses

- Lack of theoretical support. - Limited Baselines for Comparison: To provide a more comprehensive evaluation, additional baselines should be included, such as GRPO or PPO augmented with intrinsic rewards. The current comparisons are restricted to open-source LLMs and ablated variants of the algorithm, which may not fully benchmark the approach against state-of-the-art reinforcement learning methods in similar domains. - Omission of Concurrent Works: The discussion should address relevant concur

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems