ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem

Jia-Kai Dong; I-Wei Huang; Chun-Tin Wu; Yi-Tien Tsai

arXiv:2510.19423·cs.AI·January 21, 2026

ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem

Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai

PDF

Open Access 1 Video

TL;DR

ETOM is a comprehensive five-level benchmark designed to evaluate multi-hop, end-to-end tool orchestration by LLM agents within a hierarchical MCP ecosystem, addressing existing evaluation gaps.

Contribution

It introduces a novel hierarchical benchmark with ground truth and objective metrics, systematically testing agent capabilities and robustness in complex orchestration scenarios.

Findings

01

Rigid hierarchies can hinder performance without co-designed strategies

02

State-of-the-art agents show systemic weaknesses in robustness

03

ETOM exposes limitations and guides development of better agents

Abstract

We introduce ETOM, a five-level benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents within a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often assess tools in isolation, overlooking challenges such as functional overlap and cross-server orchestration, which can lead to overly optimistic evaluations. ETOM addresses these gaps by constructing ground truth through "equal function sets", enabling objective metrics such as F1 score and reducing reliance on LLM-as-a-judge evaluation. Its five-level curriculum systematically tests agent capabilities, from single-tool orchestration to complex cross-server planning, as well as robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. ETOM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem· underline

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Software System Performance and Reliability