ToolSandbox: A Stateful, Conversational, Interactive Evaluation   Benchmark for LLM Tool Use Capabilities

Jiarui Lu; Thomas Holleis; Yizhe Zhang; Bernhard Aumayer; Feng Nan,; Felix Bai; Shuang Ma; Shen Ma; Mengyu Li; Guoli Yin; Zirui Wang; Ruoming Pang

arXiv:2408.04682·cs.CL·April 18, 2025·2 cites

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan,, Felix Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, Ruoming Pang

PDF

Open Access 1 Repo 1 Video

TL;DR

ToolSandbox provides a comprehensive, stateful, conversational benchmark for evaluating large language models' tool use capabilities, revealing significant performance gaps and challenging tasks even for state-of-the-art models.

Contribution

It introduces a novel, dynamic evaluation framework with stateful tool execution and on-policy dialogue support, advancing beyond previous stateless or off-policy assessments.

Findings

01

Open source and proprietary models show significant performance gaps.

02

Complex tasks remain challenging for SOTA LLMs.

03

ToolSandbox offers new insights into LLM tool-use capabilities.

Abstract

Recent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over an arbitrary trajectory. We show that open source and proprietary models have a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/toolsandbox
noneOfficial

Videos

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities· underline

Taxonomy

TopicsLibrary Science and Information Systems · Semantic Web and Ontologies · Mathematics, Computing, and Information Processing