EvolveTool-Bench: Evaluating the Quality of LLM-Generated Tool Libraries as Software Artifacts
Alibek T. Kaliyev, Artem Maryanskyy

TL;DR
EvolveTool-Bench introduces a comprehensive benchmark for evaluating the quality of LLM-generated tool libraries, emphasizing software engineering metrics beyond task success.
Contribution
This work presents a new diagnostic benchmark with quality metrics for LLM-created tool libraries, addressing a gap in evaluating software artifacts beyond task completion.
Findings
Systems with similar task success differ significantly in library health.
Evaluation based solely on task completion misses critical software quality issues.
Tool library quality metrics reveal risks invisible to traditional benchmarks.
Abstract
Modern LLM agents increasingly create their own tools at runtime -- from Python functions to API clients -- yet existing benchmarks evaluate them almost exclusively by downstream task completion. This is analogous to judging a software engineer only by whether their code runs, ignoring redundancy, regression, and safety. We introduce EvolveTool-Bench, a diagnostic benchmark for LLM-generated tool libraries in software engineering workflows. Across three domains requiring actual tool execution (proprietary data formats, API orchestration, and numerical computation), we define library-level software quality metrics -- reuse, redundancy, composition success, regression stability, and safety -- alongside a per-tool Tool Quality Score measuring correctness, robustness, generality, and code quality. In the first head-to-head comparison of code-level and strategy-level tool evolution (ARISE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
