ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities
Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan,, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang

TL;DR
ToolSandbox provides a comprehensive, stateful, conversational benchmark for evaluating large language models' tool use capabilities, revealing significant performance gaps and challenging tasks even for state-of-the-art models.
Contribution
It introduces a novel, dynamic evaluation framework with stateful tool execution and on-policy dialogue support, advancing beyond previous stateless or off-policy assessments.
Findings
Open source and proprietary models show significant performance gaps.
Complex tasks remain challenging for SOTA LLMs.
ToolSandbox offers new insights into LLM tool-use capabilities.
Abstract
Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLibrary Science and Information Systems · Semantic Web and Ontologies · Mathematics, Computing, and Information Processing
