MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use
Wenrui Liu, Zixiang Liu, Elsie Dai, Wenhan Yu, Lei Yu, Tong Yang, Jinjun Han, Hong Gao

TL;DR
MCPAgentBench is a new benchmark designed to evaluate the tool-use capabilities of LLM-based agents in real-world scenarios, addressing current evaluation limitations with authentic tasks, simulated MCP tools, and dynamic testing environments.
Contribution
It introduces MCPAgentBench, a comprehensive benchmark with real-world MCP tasks, simulated tools, and metrics to assess LLM agents' tool selection and execution efficiency.
Findings
Significant performance differences among LLMs in complex tasks
Effective evaluation of tool discrimination abilities
Benchmark facilitates future LLM agent development
Abstract
Large Language Models (LLMs) are increasingly serving as autonomous agents, and their utilization of external tools via the Model Context Protocol (MCP) is considered a future trend. Current MCP evaluation sets suffer from issues such as reliance on external MCP services and a lack of difficulty awareness. To address these limitations, we propose MCPAgentBench, a benchmark based on real-world MCP definitions designed to evaluate the tool-use capabilities of agents. We construct a dataset containing authentic tasks and simulated MCP tools. The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors, thereby testing their tool selection and discrimination abilities. Furthermore, we introduce comprehensive metrics to measure both task completion rates and execution efficiency. Experiments conducted on various latest mainstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications
