LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin; Dinghan Shen; Silei Xu; Jianbing Han; Sixun Dong; Mian Zhang; Yebowen Hu; Shujian Liu; Simin Ma; Song Wang; Sathish Reddy Indurthi; Xun Wang; Yiran Chen; Kaiqiang Song

arXiv:2508.15760·cs.CL·August 22, 2025

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

PDF

Open Access 3 Reviews

TL;DR

LiveMCP-101 is a benchmark for evaluating AI agents' ability to solve complex, multi-step tasks using MCP tools in realistic scenarios, revealing significant challenges and guiding future improvements.

Contribution

The paper introduces LiveMCP-101, a comprehensive benchmark with a novel evaluation method based on ground-truth plans, to assess and improve MCP-enabled AI agent performance.

Findings

01

Frontier LLMs achieve below 60% success rate on the benchmark.

02

Major challenges identified in tool orchestration and token efficiency.

03

Error analysis highlights specific failure modes and areas for model improvement.

Abstract

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 4

Strengths

1. Massive MCP servers and tools are considered, rendering the benchmark comprehensive. 2. A new evaluation framework is proposed to handle dynamic tool outputs. 3. Extensive experiments are conducted to benchmark LLMs.

Weaknesses

1. **Weak Motivation:** From the perspective of LLMs, there is no apparent difference between using MCP and function calling. In this regard, I am not convinced about the motivation of this paper. Why do we need to use MCP to benchmark LLMs given that massive function-calling benchmarks already exist? For example, the latest BFCL V4 benchmark already covers multi-turn and complex tool usage scenarios. The authors are recommended to clarify the motivation of this paper. Note that the cited works

Reviewer 02Rating 6Confidence 3

Strengths

1. The new benchmark offers more challenging test tasks, as evidenced by longer average tool-calling steps required and lower success rates by mainstream LLMs. Given the rapid development of LLMs and the quick saturation of evaluations, this challenging new benchmark is valuable for advancing research on agentic LLMs. 2. The parallel evaluation framework provides a practical assessment for time-sensitive tasks. 3. The paper is overall well written. The core ideas and evaluation details are wel

Weaknesses

1. Task distribution and quality are crucial for agent evaluation benchmarks. As described in 3.1, LiveMCP-101 uses queries generated by OpenAI o3 model, but the details of the generation process remain unclear (e.g., workflows and key prompts). And using synthetic task queries may raise concerns, as these test cases may deviate from real user needs or be biased towards the LLM used for synthesis. Given that existing agent benchmarks like GAIA and SWEBench provide test tasks from real people, ca

Reviewer 03Rating 6Confidence 3

Strengths

1. This paper addresses the flaws of existing MCP benchmarks (static, single-step) by proposing LiveMCP-101—the benchmark for dynamic real-world scenarios. It covers multi-step, cross-domain tasks and aligns with the practical deployment needs of agents. 2. The parallel real-time evaluation (synchronized execution of dual agents) avoids the timeliness bias of dynamic data. The validated execution plans provide reliable evaluation anchors, making the design innovative. 3. The experimental sect

Weaknesses

1. Lacks verification of differences across multiple LLM judges. 2. The long-term iteration of dynamic APIs may lead to changes in their call logic. This raises questions about whether the current dual agent verification framework (synchronized execution of reference and evaluated agents) remains feasible—since pre-validated execution plans for reference agents could become obsolete due to API changes, undermining the accuracy of real-time result alignment.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Advanced Database Systems and Queries · AI-based Problem Solving and Planning