ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

Xingshan Zeng; Weiwen Liu; Lingzhi Wang; Liangyou Li; Fei Mi; Yasheng Wang; Lifeng Shang; Xin Jiang; Qun Liu

arXiv:2508.12685·cs.CL·February 16, 2026

ToolACE-MT: Non-Autoregressive Generation for Agentic Multi-Turn Interaction

Xingshan Zeng, Weiwen Liu, Lingzhi Wang, Liangyou Li, Fei Mi, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu

PDF

Open Access 1 Models 3 Reviews

TL;DR

ToolACE-MT introduces a non-autoregressive, multi-stage framework for efficient and high-quality multi-turn agentic dialogue generation, reducing reliance on costly autoregressive methods and enhancing data construction for tool-augmented LLMs.

Contribution

It presents a novel non-autoregressive iterative generation approach with three stages, improving efficiency and quality in agentic multi-turn dialogue data creation.

Findings

01

Enables efficient multi-turn dialogue generation

02

Produces high-quality, coherent dialogues

03

Reduces computational costs compared to autoregressive methods

Abstract

Agentic task-solving with Large Language Models (LLMs) requires multi-turn, multi-step interactions, often involving complex function calls and dynamic user-agent exchanges. Existing simulation-based data generation methods for such scenarios rely heavily on costly autoregressive interactions between multiple LLM agents, thereby compromising the practical efficiency of agentic data generation. In this paper, we propose ToolACE-MT, a novel Non-Autoregressive Iterative Generation framework for constructing high-quality multi-turn agentic dialogues. ToolACE-MT generates full conversational trajectories through three stages: coarse-grained initialization, iterative refinement, and offline verification. The initialization phase builds a structurally complete yet semantically coarse dialogue skeleton; the iterative refinement phase introduces realistic complexities and continued refinement…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. NAIG produces more coherent, context-aware, and complex multi-turn dialogues by combination of structured initialization and iterative refinement. 2. Their method reduces reliance on expensive multi-agent autoregressive interactions, lowering API costs 3. Offline verification enhances data reliability by catching structural and semantic inconsistencies that are hard to detect during generation. 4. Models trained on NAIG-generated data significantly outperform both MAS-based on multi-turn benc

Weaknesses

1. NAIG generation quality significantly drops when using smaller models (e.g., GPT-4o-mini), limiting its applicability in low-resource settings. 2. Models trained on NAIG-generated data may over-optimize for multi-turn planning, potentially at the expense of performance on single-turn or one-shot queries. 3. The experiments only compare NAIG against a single model despite the diversity of baselines available for tool-calling dialogue generation, raising concerns about the sufficiency and gener

Reviewer 02Rating 4Confidence 4

Strengths

1. This work identifies three important challenges in simulation-based data generation for agentic multi-turn interactions. 2. The dataset has the potential to significantly contribute to research on agentic LLMs, provided it can be publicly released. 3. The paper is clearly written, and the methodology is described in sufficient detail.

Weaknesses

1. The multi-agent simulation process does not incorporate offline verification, and such verification is not an inherent requirement for non-autoregressive methods. This raises concerns about the fairness of comparing NAGI with methods that include offline verification. 2. The improvement of NAIG over Multi-Agent Simulation appears to be marginal. Considering Weakness 1, the core component of the non-autoregressive framework (i.e., NAIG without offline verification) does not seem to provide a m

Reviewer 03Rating 6Confidence 4

Strengths

- The paper addresses an important problem: generating realistic, tool-augmented agentic dialogues for training LLMs with function-calling and multi-turn behaviour. - The framework is clearly presented. The three-stage design offers a practical pipeline that balances structure, refinement and quality control. - The experimental evaluation is reasonably broad. Multiple benchmarks, different model backbones, cost vs quality trade-off are all considered. - The authors attempt to quantify cost savin

Weaknesses

1. Lack of human evaluation and direct quantitative assessment of the generated data itself. The paper focuses on downstream model performance improvements, but fails to report metrics such as diversity (Distinct-n), structural repetition (how many dialogue skeleton templates were reused), naturalness/human preference, or other data-intrinsic quality measures. 2. Cost evaluation metric is limited. While the paper reports that API calls drop from ≈ 275k (MAS) to ≈ 188k (their method), there is no

Code & Models

Models

🤗
Team-ACE/ToolACE-2.5-Llama-3.1-8B
model· 54 dl· ♡ 4
54 dl♡ 4

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multi-Agent Systems and Negotiation