GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Jize Wang; Xuanxuan Liu; Yining Li; Songyang Zhang; Yijun Wang; Zifei Shan; Xinyi Le; Cailian Chen; Xinping Guan; Dacheng Tao

arXiv:2604.15715·cs.CL·April 20, 2026

GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows

Jize Wang, Xuanxuan Liu, Yining Li, Songyang Zhang, Yijun Wang, Zifei Shan, Xinyi Le, Cailian Chen, Xinping Guan, Dacheng Tao

PDF

1 Repo

TL;DR

GTA-2 introduces a hierarchical benchmark for general tool agents, evaluating atomic tool use and complex workflows with real-world data, revealing significant performance gaps and guiding future development.

Contribution

It presents a new benchmark with real-world tasks and a recursive evaluation method, highlighting current limitations and improvements for general-purpose AI agents.

Findings

01

Frontier models struggle on atomic tasks (below 50%)

02

Models perform poorly on workflows (around 14.39%)

03

Checkpoint-guided feedback and advanced frameworks improve performance

Abstract

The development of general-purpose agents requires a shift from executing simple instructions to completing complex, real-world productivity workflows. However, current tool-use benchmarks remain misaligned with real-world requirements, relying on AI-generated queries, dummy tools, and limited system-level coordination. To address this, we propose GTA-2, a hierarchical benchmark for General Tool Agents (GTA) spanning atomic tool use and open-ended workflows. Built on real-world authenticity, it leverages real user queries, deployed tools, and multimodal contexts. (i) GTA-Atomic, inherited from our prior GTA benchmark, evaluates short-horizon, closed-ended tool-use precision. (ii) GTA-Workflow introduces long-horizon, open-ended tasks for realistic end-to-end completion. To evaluate open-ended deliverables, we propose a recursive checkpoint-based evaluation mechanism that decomposes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

open-compass/GTA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.