ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem
Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai

TL;DR
ETOM is a comprehensive five-level benchmark designed to evaluate multi-hop, end-to-end tool orchestration by LLM agents within a hierarchical MCP ecosystem, addressing existing evaluation gaps.
Contribution
It introduces a novel hierarchical benchmark with ground truth and objective metrics, systematically testing agent capabilities and robustness in complex orchestration scenarios.
Findings
Rigid hierarchies can hinder performance without co-designed strategies
State-of-the-art agents show systemic weaknesses in robustness
ETOM exposes limitations and guides development of better agents
Abstract
We introduce ETOM, a five-level benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents within a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often assess tools in isolation, overlooking challenges such as functional overlap and cross-server orchestration, which can lead to overly optimistic evaluations. ETOM addresses these gaps by constructing ground truth through "equal function sets", enabling objective metrics such as F1 score and reducing reliance on LLM-as-a-judge evaluation. Its five-level curriculum systematically tests agent capabilities, from single-tool orchestration to complex cross-server planning, as well as robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. ETOM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Software System Performance and Reliability
