ToolACE: Winning the Points of LLM Function Calling

Weiwen Liu; Xu Huang; Xingshan Zeng; Xinlong Hao; Shuai Yu; Dexun Li; Shuai Wang; Weinan Gan; Zhengying Liu; Yuanqing Yu; Zezhong Wang; Yuxian Wang; Wu Ning; Yutai Hou; Bin Wang; Chuhan Wu; Xinzhi Wang; Yong Liu; Yasheng Wang; Duyu Tang; Dandan Tu; Lifeng Shang; Xin Jiang; Ruiming Tang; Defu Lian; Qun Liu; and Enhong Chen

arXiv:2409.00920·cs.LG·July 28, 2025·3 cites

ToolACE: Winning the Points of LLM Function Calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, Zezhong Wang, Yuxian Wang, Wu Ning, Yutai Hou, Bin Wang, Chuhan Wu, Xinzhi Wang, Yong Liu, Yasheng Wang, Duyu Tang, Dandan Tu, Lifeng Shang, Xin Jiang

PDF

Open Access 5 Models 4 Datasets 3 Reviews

TL;DR

ToolACE introduces an automated pipeline for generating high-quality, diverse function-calling data, enabling smaller models to achieve state-of-the-art performance comparable to GPT-4 on function calling tasks.

Contribution

The paper presents a novel self-evolution synthesis process and dual-layer verification system for creating comprehensive function-calling datasets, improving over existing synthetic data methods.

Findings

01

Models trained on ToolACE data outperform previous benchmarks.

02

8B parameter models achieve state-of-the-art results on the Berkeley Function-Calling Leaderboard.

03

The approach rivals GPT-4 performance with significantly fewer parameters.

Abstract

Function calling significantly extends the application boundary of large language models, where high-quality and diverse training data is critical for unlocking this capability. However, real function-calling data is quite challenging to collect and annotate, while synthetic data generated by existing pipelines tends to lack coverage and accuracy. In this paper, we present ToolACE, an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data. ToolACE leverages a novel self-evolution synthesis process to curate a comprehensive API pool of 26,507 diverse APIs. Dialogs are further generated through the interplay among multiple agents, guided by a formalized thinking process. To ensure data accuracy, we implement a dual-layer verification system combining rule-based and model-based checks. We demonstrate that models trained on our synthesized data,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

- Collecting new data in a scalable way is important - The performance looks interesting as well by finetuning a small model

Weaknesses

The experiment section is not quite convincing yet. Since the authors want to show the effectiveness of using their newly collected API (which according to table 1) is much more comprehensive, the authors should compare the performance obtained by finetuning on Table 1 datasets e.g. ToolLLM and that obtained by finetuning on their newly collected API data

Reviewer 02Rating 6Confidence 4

Strengths

ToolACE introduces a unique self-evolution synthesis method, which is a systematic approach to generating diverse and complex data for function calling, addressing a key limitation in existing tool-augmented LLMs. The paper provides extensive experiments and ablation studies, comparing ToolACE-trained models with existing benchmarks on widely used datasets like BFCL and APIBank, and demonstrating superior performance.

Weaknesses

1. Please include a complete example of a prompt and LLM response in the appendix so that readers can intuitively understand how the process works in practice. 2. The paper lacks clarity and involves overly complex technical concepts. Although constructing a simulated dataset and fine-tuning the model are effective approaches to enhancing the LLM's function call capabilities, the additional concepts introduced, such as Self-Evolution, Self-Guided, Dual-Layer, and Multi-Agent, make the main idea

Reviewer 03Rating 6Confidence 2

Strengths

- The introduction of ToolACE's multi-step data generation, including evolutionary diversity and self-guided complexity, provides an innovative solution for generating complex and diverse function-calling data. - The DLV system, combining rule-based and model-based checks, enhances the reliability of the generated data. This is a strong point, as it helps maintain data quality, which is critical for training LLMs effectively. - The paper provides an extensive set of experiments, including comp

Weaknesses

- The evaluation scenarios are limited to synthetic function-calling tasks and benchmarks like BFCL and APIBank. The paper would benefit from more realistic evaluations or applications in real-world tool usage scenarios. This would better demonstrate ToolACE’s utility beyond controlled benchmark settings. - The self-guided dialog generation process heavily relies on the LLM being trained to evaluate the complexity of generated data. This creates a circular dependency where the model is used bot

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies

MethodsAttention Is All You Need · Byte Pair Encoding · Absolute Position Encodings · Softmax · Label Smoothing · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Linear Layer · Adam