Tool-Genesis: A Task-Driven Tool Creation Benchmark for Self-Evolving Language Agent
Bowei Xia, Mengkang Hu, Shijian Wang, Jiarui Jin, Wenxiang Jiao, Yuan Lu, Kexin Li, Ping Luo

TL;DR
Tool-Genesis is a diagnostic benchmark that evaluates self-evolving language agents' ability to create and utilize tools from abstract requirements, highlighting current limitations and guiding future improvements.
Contribution
It introduces a new benchmark for assessing agent tool creation capabilities without predefined specs, emphasizing diagnostic evaluation over performance alone.
Findings
State-of-the-art models struggle with precise tool interface generation.
Minor initial flaws in tool synthesis lead to significant downstream performance drops.
Tool-Genesis provides a multi-dimensional assessment of agent tool creation abilities.
Abstract
Research on self-evolving language agents has accelerated, drawing increasing attention to their ability to create, adapt, and maintain tools from task requirements. However, existing benchmarks predominantly rely on predefined specifications, which limits scalability and hinders truly autonomous evolution. While recent studies attempt to dynamically generate tools, they primarily emphasize downstream performance, resulting in a "black-box" evaluation that makes it difficult to attribute failures to specific causes. To address this, we propose Tool-Genesis, a diagnostic benchmark designed to quantify agent capabilities across multiple dimensions, including interface compliance, functional correctness, and downstream utility. Tool-Genesis evaluates whether agents can construct task-relevant tools solely from abstract requirements (without preset specifications) and use them to solve…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage and cultural evolution · Natural Language Processing Techniques · Topic Modeling
