GTA: A Benchmark for General Tool Agents

Jize Wang; Zerun Ma; Yining Li; Songyang Zhang; Cailian Chen; Kai; Chen; Xinyi Le

arXiv:2407.08713·cs.CL·November 25, 2024·2 cites

GTA: A Benchmark for General Tool Agents

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai, Chen, Xinyi Le

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

GTA is a comprehensive benchmark designed to evaluate the real-world tool-use capabilities of large language models using authentic user queries, deployed tools, and multimodal inputs, revealing current limitations and guiding future improvements.

Contribution

The paper introduces GTA, a novel benchmark with real user queries, deployed tools, and multimodal inputs to assess LLMs' practical tool-use abilities in realistic scenarios.

Findings

01

GPT-4 completes less than 50% of tasks

02

Most LLMs achieve below 25% success rate

03

Reveals bottlenecks in current LLM tool-use capabilities

Abstract

Significant focus has been placed on integrating large language models (LLMs) with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-compass/GTA
pytorchOfficial

Datasets

Jize1/GTA
dataset· 57 dl
57 dl

Videos

GTA: A Benchmark for General Tool Agents· slideslive

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Business Process Modeling and Analysis

MethodsAttention Is All You Need · Byte Pair Encoding · Layer Normalization · Linear Layer · Label Smoothing · Adam · Dropout · Multi-Head Attention · Dense Connections · Softmax