GTA: A Benchmark for General Tool Agents
Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai, Chen, Xinyi Le

TL;DR
GTA is a comprehensive benchmark designed to evaluate the real-world tool-use capabilities of large language models using authentic user queries, deployed tools, and multimodal inputs, revealing current limitations and guiding future improvements.
Contribution
The paper introduces GTA, a novel benchmark with real user queries, deployed tools, and multimodal inputs to assess LLMs' practical tool-use abilities in realistic scenarios.
Findings
GPT-4 completes less than 50% of tasks
Most LLMs achieve below 25% success rate
Reveals bottlenecks in current LLM tool-use capabilities
Abstract
Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Business Process Modeling and Analysis
MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax
