The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li; Wenshuo Zhao; Jian Zhao; Weihao Zeng; Haoze Wu; Xiaochen Wang; Rui Ge; Yuxuan Cao; Yuzhen Huang; Wei Liu; Junteng Liu; Zhaochen Su; Yiyang Guo; Fan Zhou; Lueyang Zhang; Juan Michelini; Xingyao Wang; Xiang Yue; Shuyan Zhou; Graham Neubig; Junxian He

arXiv:2510.25726·cs.CL·February 27, 2026

The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He

PDF

1 Datasets 3 Reviews

TL;DR

The paper introduces Toolathlon, a comprehensive benchmark with diverse, realistic tasks and environments to evaluate and improve language agents' ability to perform complex, multi-step real-world tasks across various applications.

Contribution

It presents Toolathlon, a new benchmark with diverse applications, realistic initial states, and long-horizon tasks, addressing limitations of prior narrow-domain benchmarks.

Findings

01

Current SOTA models perform poorly on Toolathlon, with success rates below 40%.

02

Most models require around 20 tool calls per task, indicating complexity.

03

Benchmark reveals significant gaps in current language agent capabilities.

Abstract

Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 2Confidence 5

Strengths

+ TOOLATHLON spans a 32 real-world applications demonstrating tool and environment diversity. + Execution-Based Evaluation: TOOLATHLON uses deterministic evaluation scripts that compare environment states, ensuring objectivity and reproducibility.

Weaknesses

Benchmark Labeling and Definitions: The categorization of existing benchmarks in Table 1 as “Real Tools” or “Not-Real Tools” appears inconsistent and potentially misleading. For example, τ-Bench is flagged as not-real despite subsets (“airline,” “retail”) interacting with actual databases. Similarly, BFCL is marked not-real, although it supports real execution in the “Execute” category, with "Crowd Sourced" being community-contributed tools. LiveMCPBench is labeled not-real despite its claim to

Reviewer 02Rating 6Confidence 4

Strengths

- A strong benchmark, heavy engineering behind - will be good use for the community

Weaknesses

- Not really a weakness but the paper uses the average number of turns as a proxy for task difficulty. While reasonable, this is an outcome-based metric that can be influenced by the agent's (in)efficiency. A more intrinsic, task-defined complexity metric (e.g., based on the number of required applications, minimum number of steps in a ground-truth trajectory) could provide a slightly more objective measure of difficulty when analyzing performance across Easy/Medium/Hard tasks.

Reviewer 03Rating 4Confidence 3

Strengths

1. The benchmark contains a substantially larger number of tasks than prior MCP benchmarks and evaluates models across realistic settings, including real-world state initialization and fuzzy instructions. 2. Rigorous task validation and design make the benchmark reliable and valuable for MCP-agent research. 3. The authors benchmark a wide range of state-of-the-art LLMs and provide detailed analysis, including case studies of failure modes and in-depth discussion of experimental results.

Weaknesses

The current evaluation setup for LLM agents with tools is relatively simple. The authors specify all tools for a task and provide the full tool list to the agent, which can significantly inflate the input context. More sophisticated tool-selection strategies could be explored—for example, using retrieval methods to surface relevant tools dynamically rather than supplying the entire tool inventory upfront.

Code & Models

Datasets

hkust-nlp/Toolathlon-Trajectories
dataset· 2.3k dl
2.3k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.