ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction
Xingshan Zeng, Weiwen Liu, Lingzhi Wang, Liangyou Li, Fei Mi, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu

TL;DR
ToolACE-MT introduces a non-autoregressive, multi-stage framework for efficient and high-quality multi-turn agentic dialogue generation, reducing reliance on costly autoregressive methods and enhancing data construction for tool-augmented LLMs.
Contribution
It presents a novel non-autoregressive iterative generation approach with three stages, improving efficiency and quality in agentic multi-turn dialogue data creation.
Findings
Enables efficient multi-turn dialogue generation
Produces high-quality, coherent dialogues
Reduces computational costs compared to autoregressive methods
Abstract
Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby compromising the practical efficiency of agentic data generation. In this paper, we propose ToolACE-MT, a novel Non-Autoregressive Iterative Generation framework for constructing high-quality multi-turn agentic dialogues. ToolACE-MT generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement…
Peer Reviews
Decision·ICLR 2026 Poster
1. NAIG produces more coherent, context-aware, and complex multi-turn dialogues by combination of structured initialization and iterative refinement. 2. Their method reduces reliance on expensive multi-agent autoregressive interactions, lowering API costs 3. Offline verification enhances data reliability by catching structural and semantic inconsistencies that are hard to detect during generation. 4. Models trained on NAIG-generated data significantly outperform both MAS-based on multi-turn benc
1. NAIG generation quality significantly drops when using smaller models (e.g., GPT-4o-mini), limiting its applicability in low-resource settings. 2. Models trained on NAIG-generated data may over-optimize for multi-turn planning, potentially at the expense of performance on single-turn or one-shot queries. 3. The experiments only compare NAIG against a single model despite the diversity of baselines available for tool-calling dialogue generation, raising concerns about the sufficiency and gener
1. This work identifies three important challenges in simulation-based data generation for agentic multi-turn interactions. 2. The dataset has the potential to significantly contribute to research on agentic LLMs, provided it can be publicly released. 3. The paper is clearly written, and the methodology is described in sufficient detail.
1. The multi-agent simulation process does not incorporate offline verification, and such verification is not an inherent requirement for non-autoregressive methods. This raises concerns about the fairness of comparing NAGI with methods that include offline verification. 2. The improvement of NAIG over Multi-Agent Simulation appears to be marginal. Considering Weakness 1, the core component of the non-autoregressive framework (i.e., NAIG without offline verification) does not seem to provide a m
- The paper addresses an important problem: generating realistic, tool-augmented agentic dialogues for training LLMs with function-calling and multi-turn behaviour. - The framework is clearly presented. The three-stage design offers a practical pipeline that balances structure, refinement and quality control. - The experimental evaluation is reasonably broad. Multiple benchmarks, different model backbones, cost vs quality trade-off are all considered. - The authors attempt to quantify cost savin
1. Lack of human evaluation and direct quantitative assessment of the generated data itself. The paper focuses on downstream model performance improvements, but fails to report metrics such as diversity (Distinct-n), structural repetition (how many dialogue skeleton templates were reused), naturalness/human preference, or other data-intrinsic quality measures. 2. Cost evaluation metric is limited. While the paper reports that API calls drop from ≈ 275k (MAS) to ≈ 188k (their method), there is no
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multi-Agent Systems and Negotiation
