MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
Shiqing Fan, Xichen Ding, Liang Zhang, Linjian Mo

TL;DR
MCPToolBench++ is a comprehensive benchmark designed to evaluate AI agents' ability to effectively use a large variety of MCP tools across multiple domains, addressing current evaluation gaps.
Contribution
The paper introduces MCPToolBench++, a large-scale, multi-domain benchmark with over 4,000 MCP servers, to evaluate LLMs' MCP tool use capabilities comprehensively.
Findings
State-of-the-art LLMs show varied success rates across MCP tools.
Benchmark covers over 40 categories and includes single-step and multi-step calls.
Provides insights into LLMs' real-world MCP tool integration performance.
Abstract
LLMs' capabilities are enhanced by using function calls to integrate various data sources or API results into the context window. Typical tools include search, web crawlers, maps, financial data, file systems, and browser usage, etc. Integrating these data sources or functions requires a standardized method. The Model Context Protocol (MCP) provides a standardized way to supply context to LLMs. However, the evaluation of LLMs and AI Agents' MCP tool use abilities suffer from several issues. First, there's a lack of comprehensive datasets or benchmarks to evaluate various MCP tools. Second, the diverse formats of response from MCP tool call execution further increase the difficulty of evaluation. Additionally, unlike existing tool-use benchmarks with high success rates in functions like programming and math functions, the success rate of real-world MCP tool is not guaranteed and varies…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Agent-Based Network Management · Software System Performance and Reliability · Scientific Computing and Data Management
