MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

Weikang Zhao; Xili Wang; Chengdi Ma; Lingbin Kong; Zhaohua Yang; Mingxiang Tuo; Xiaowei Shi; Yitao Zhai; Xunliang Cai

arXiv:2508.18669·cs.AI·August 27, 2025

MUA-RL: Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use

Weikang Zhao, Xili Wang, Chengdi Ma, Lingbin Kong, Zhaohua Yang, Mingxiang Tuo, Xiaowei Shi, Yitao Zhai, Xunliang Cai

PDF

3 Models 1 Datasets 4 Reviews

TL;DR

This paper introduces MUA-RL, a reinforcement learning framework that integrates simulated users into multi-turn interactions, enabling agents to better understand user needs and utilize tools effectively in dynamic scenarios.

Contribution

MUA-RL is the first RL approach to incorporate simulated users during training for agentic tool use in multi-turn interactions.

Findings

01

MUA-RL-32B outperforms larger models on multiple benchmarks.

02

The framework effectively improves multi-turn tool use capabilities.

03

Agents trained with MUA-RL demonstrate enhanced communication and problem-solving skills.

Abstract

With the recent rapid advancement of Agentic Intelligence, agentic tool use in LLMs has become increasingly important. During multi-turn interactions between agents and users, the dynamic, uncertain, and stochastic nature of user demands poses significant challenges to the agent's tool invocation capabilities. Agents are no longer expected to simply call tools to deliver a result; rather, they must iteratively refine their understanding of user needs through communication while simultaneously invoking tools to resolve user queries. Existing reinforcement learning (RL) approaches for tool use lack the integration of genuinely dynamic users during the RL training process. To bridge this gap, we introduce MUA-RL (Multi-turn User-interacting Agent Reinforcement Learning for agentic tool use), a novel reinforcement learning framework that, for the first time in the field of agentic tool use,…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

- The motivation is very compelling. Specifically, the authors note that users often adjust their questions and expectations based on the model’s responses - The paper moves beyond static datasets or scripted dialogues and frames this problem of agent and user co-evolution - The method MUA-RL combines synthetic cold-start data generation with interactive RL training, producing a more realistic environment for the agent. - The evaluation spans four major multi-turn benchmarks—TAU1-Bench, TAU2-Be

Weaknesses

- The paper uses simulated users as a proxy for human behavior, which is reasonable, but lacks empirical validation with actual human users. This could cause overfitting to synthetic conversational patterns. I suggest the authors do a user study with real human users to benchmark their method, as the title of the paper is also specifically addressing users. There needs to be validation of the LLM-simulated users as well. - There is no discussion of the safety risks of the method and how it affe

Reviewer 02Rating 6Confidence 2

Strengths

1. This work proposes a framework to integrate LLM-simulated users into RL rollouts for agentic tool use, addressing the critical gap of dynamic user interactions in existing RL methods. 2. MUA-RL-32B matches or outperforms much larger models (DeepSeek-V3-0324, Qwen3-235B-A22B) across multiple benchmarks, demonstrating remarkable efficiency gains. 3. The paper provides a detailed analysis of the training dynamics. This analysis is insightful and better demonstrates the MUA-RL process.

Weaknesses

1. This work only uses retail and airline datasets from TAU1-Bench for RL training, which is in a similar distribution as the testing environment, potentially limiting generalization to other domains. 2. This proposed method relies on GPT-4o as the user simulator during training. This is costly and may not be scalable.

Reviewer 03Rating 2Confidence 3

Strengths

1. Integration with MCP tool server with reinforcement learning and user interaction 2. Creating an RL loop on three interesting interactive tasks with a user: Tao bench, Berkeley Function-Calling Leaderboard, ACEBench Agent 3. Performs an ablation comparing non-thinking and cold start 4. The paper ablates different user LLMs that have a different LLM backend

Weaknesses

1. The paper states that it introduces “a novel multi-turn user-interacting reinforcement learning framework that incorporates LLM-simulated users into the reinforcement learning rollouts.” (line 74-75) However, similar ideas have been explored in prior work. For example: a. LMRL-Gym [1]: Presents a framework for developing and evaluating human simulators, and integrates them into the LLM and RL training loop b. Offline RL with Simulated Users [2,3]: Previous studies have incorporated us

Reviewer 04Rating 2Confidence 4

Strengths

1. The focus on multi-turn and user-interactive settings addresses essential aspects for developing more autonomous agents, introducing greater complexity to the learning process. 2. The paper includes evaluations on multiple benchmarks, such as $\tau^2$-Bench, BFCL-V3 Multi-Turn, and ACEBench.

Weaknesses

1. Although the paper claims that MUA-RL represents a novel reinforcement learning framework, its novelty is not convincingly demonstrated. The task formulation, reward design, and training methodology appear relatively conventional, and the distinctions from existing approaches (such as [1,2,3]) are not clearly presented. 2. The $\tau$-bench are used as both the training and test dataset. 3. The paper introduces two agentic data synthesis pipelines, but their effectiveness is not validated

Code & Models

Models

Datasets

zzwkk/MUA-RL-Dataset
dataset· 31 dl
31 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.