SynthTools: A Framework for Scaling Synthetic Tools for Agent Development
Tommaso Castellani, Naimeng Ye, Daksh Mittal, Thomson Yen, Hongseok Namkoong

TL;DR
SynthTools is a scalable framework that automatically generates diverse and realistic synthetic tools for training and evaluating AI agents in complex, long-horizon tasks, overcoming limitations of real-world APIs.
Contribution
It introduces a novel framework with tool generation, simulation, and auditing components, enabling large-scale, diverse, and reliable synthetic tool ecosystems for AI agent development.
Findings
SynthTools can produce toolsets covering twice as many domains as prior work.
Tool simulation and audit achieve 94% and 99% accuracy respectively.
Constructed downstream tasks challenge state-of-the-art models.
Abstract
AI agents increasingly rely on external tools to solve complex, long-horizon tasks. Advancing such agents requires reproducible evaluation and large-scale training in controllable, diverse, and realistic tool-use environments. However, real-world APIs are limited in availability, domain coverage, and stability, often requiring access keys and imposing rate limits, which render them impractical for stable evaluation or scalable training. To address these challenges, we introduce SynthTools, a flexible and scalable framework for generating synthetic tool ecosystems. Our framework consists of three core components: Tool Generation for automatic and scalable creation of diverse tools, Tool Simulation to emulate realistic tool behaviors, and Tool Audit to ensure correctness and consistency of tool simulation. To illustrate its scalability, we show that SynthTools can readily produce toolsets…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The three-component pipeline (generation, simulation, audit) is well-motivated and addresses key challenges systematically. The validation against ACEBench (94% accuracy) and manual stress testing (99% judge accuracy) provides concrete evidence of reliability.
The paper's central claim is that agents trained on synthetic tools can transfer learned capabilities to real-world interfaces. While cited references (Li et al., 2023; Kimi, 2025; Sullivan et al., 2025) provide "supporting evidence," the paper itself conducts no experiments demonstrating this transfer. This is a critical gap—the framework's utility fundamentally depends on whether synthetic tool training actually improves real-world agent performance. I would suggest the authors include experim
They tackle a key component in modern LLM agent development, which is the creation of diverse enough tools. The way this is done is quite sound, with many internal checks and validators. The final analysis also shows that the generated set of tools is diverse (a key aspect)
It is not totally clear that scaling the number of tools is a right approach for LLM agents. Valid alternatives are minimizing the tools (such as only computer use, bash or web search) or creating tools on the fly as needed for one domain. This is just on the motivation side, and does not affect the quality of the work itself The biggest question I have is "so what?". This paper shows how to create a large and diverse set of tools. What is not clear is how this is valuable in the end. One obvio
## Well-motivated, simple design SynthTools follows a reasonable method of using LLMs to generate ideas at scale, validate those ideas align with some values of diversity, and simulate the tools. Using LLMs to simulate tools rather than relying on real environments or even engineered simulations is well-motivated. ## Evaluation of each SynthTools component The authors designed appropriate evaluation for their system, and the results provide assurance in the quality of SynthTools. In the simula
## Missing demonstrated value While the motivation for SynthTools is clear, as a method for making environments with various tools and constraints to train AI systems in long-horizon tool-using tasks, these results are missing from the paper itself. Evaluation is on the correctness of subcomponents of the system, but there is no evaluation on training an actual AI system in the ecosystems generated by SynthTools. Even within the existing evaluation of SynthTools components, there seems to be no
1. Scalability and diversity: it outperforms existing benchmarks in both the number of domains and tools per domain, enabling training on a scalable synthetic data previously unachievable. 2. Reliability and Stability: by a robust, multi-stage quality control process, featuring a tool simulator with 93.6% accuracy and an LLM judge validated at 99% accuracy with a 0% false-positive rate over 300 stress test cases, which is critical for trustworthy evaluation. 3. Hierarchical generation method:
1. There is a trade-off: the framework achieves scalability by abstracting away implementation-level fidelity, potentially leaving LLM unprepared for real-world execution challenges, such as network unstable, API update or other errors. Want to see more discussion on the gap between synthetic data and real-world API. 2. As mentioned in paper, the systemic risk of its LLM-as-judgle cannot be ignore. The stress test case not eliminate my concern on the final quality of synthetic data quality. Also
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Games · Multi-Agent Systems and Negotiation · AI-based Problem Solving and Planning
