MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use

Wenrui Liu; Zixiang Liu; Elsie Dai; Wenhan Yu; Lei Yu; Tong Yang; Jinjun Han; Hong Gao

arXiv:2512.24565·cs.AI·January 22, 2026

MCPAgentBench: A Real-world Task Benchmark for Evaluating LLM Agent MCP Tool Use

Wenrui Liu, Zixiang Liu, Elsie Dai, Wenhan Yu, Lei Yu, Tong Yang, Jinjun Han, Hong Gao

PDF

Open Access

TL;DR

MCPAgentBench is a new benchmark designed to evaluate the tool-use capabilities of LLM-based agents in real-world scenarios, addressing current evaluation limitations with authentic tasks, simulated MCP tools, and dynamic testing environments.

Contribution

It introduces MCPAgentBench, a comprehensive benchmark with real-world MCP tasks, simulated tools, and metrics to assess LLM agents' tool selection and execution efficiency.

Findings

01

Significant performance differences among LLMs in complex tasks

02

Effective evaluation of tool discrimination abilities

03

Benchmark facilitates future LLM agent development

Abstract

Large Language Models (LLMs) are increasingly serving as autonomous agents, and their utilization of external tools via the Model Context Protocol (MCP) is considered a future trend. Current MCP evaluation sets suffer from issues such as reliance on external MCP services and a lack of difficulty awareness. To address these limitations, we propose MCPAgentBench, a benchmark based on real-world MCP definitions designed to evaluate the tool-use capabilities of agents. We construct a dataset containing authentic tasks and simulated MCP tools. The evaluation employs a dynamic sandbox environment that presents agents with candidate tool lists containing distractors, thereby testing their tool selection and discrimination abilities. Furthermore, we introduce comprehensive metrics to measure both task completion rates and execution efficiency. Experiments conducted on various latest mainstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Multimodal Machine Learning Applications