MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

Shiqing Fan; Xichen Ding; Liang Zhang; Linjian Mo

arXiv:2508.07575·cs.AI·August 12, 2025

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo

PDF

Open Access

TL;DR

MCPToolBench++ is a comprehensive benchmark designed to evaluate AI agents' ability to effectively use a large variety of MCP tools across multiple domains, addressing current evaluation gaps.

Contribution

The paper introduces MCPToolBench++, a large-scale, multi-domain benchmark with over 4,000 MCP servers, to evaluate LLMs' MCP tool use capabilities comprehensively.

Findings

01

State-of-the-art LLMs show varied success rates across MCP tools.

02

Benchmark covers over 40 categories and includes single-step and multi-step calls.

03

Provides insights into LLMs' real-world MCP tool integration performance.

Abstract

LLMs' capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents' MCP tool use abilities suffer from several issues. First, there's a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool-use benchmarks with high success rates in functions like programming and math functions, the success rate of real-world MCP tool is not guaranteed and varies…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Agent-Based Network Management · Software System Performance and Reliability · Scientific Computing and Data Management