ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Yuanyang Li; Xue Yang; Longyue Wang; Weihua Luo; Hongyang Chen

arXiv:2605.10787·cs.AI·May 21, 2026

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen

PDF

TL;DR

ComplexMCP is a comprehensive benchmark designed to evaluate LLM agents' ability to operate in complex, interdependent tool environments with environmental noise and API failures, highlighting current limitations.

Contribution

The paper introduces ComplexMCP, a novel benchmark with over 300 tools and dynamic environment simulation to rigorously assess LLM agent performance in realistic scenarios.

Findings

01

Top-tier models achieve less than 60% success rate.

02

Performance gap between models and humans is significant.

03

Identified bottlenecks include tool retrieval saturation, over-confidence, and strategic defeatism.

Abstract

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $ComplexMCP$ , a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $ComplexMCP$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.