TL;DR
MCPVerse is a comprehensive real-world benchmark with over 550 tools and 140k tokens, designed to evaluate and improve large language models' ability to use external tools effectively in complex, time-sensitive tasks.
Contribution
This paper introduces MCPVerse, the first expansive, real-world benchmark with a large tool set and outcome-based evaluation for assessing agentic tool use in LLMs.
Findings
Most models' performance drops with larger tool sets.
Claude-4-Sonnet effectively leverages expanded exploration.
MCPVerse sets a new standard for real-world agentic tool use evaluation.
Abstract
Large Language Models (LLMs) are evolving from text generators into reasoning agents. This transition makes their ability to use external tools a critical capability. However, evaluating this skill presents a significant challenge. Existing benchmarks are often limited by their reliance on synthetic tools and severely constrained action spaces. To address these limitations, we introduce MCPVerse, an expansive, real-world benchmark for evaluating agentic tool use. MCPVerse integrates more than 550 real-world, executable tools to create an unprecedented action space exceeding 140k tokens, and employs outcome-based evaluation with real-time ground truth for time-sensitive tasks. We benchmarked the state-of-the-art LLMs across three modes (Oracle, Standard, and Max-Scale), revealing that while most models suffer performance degradation when confronted with larger tool sets, the agentic…
Peer Reviews
Decision·Submitted to ICLR 2026
Relevance: MCPs are gaining popularity and measuring agentic tool use in the context of MCPs makes a lot of sense and is timely. Unsaturated: The maximum score on the benchmark is still <70%, so the benchmark is far from saturated. Model evaluation: A diversity of models are evaluated which is great to see. The specific evaluation of large context is also useful. Judging: Model scores seem relatively stable across judge models (e.g., GPT-4o vs QwQ).
Sharing tasks: This benchmark would benefit substantially from a more thorough discussion of example tasks so that quality of the evaluation can be better understood. The task types, complexity, and sensitivity are included in the paper, but it is not possible to vet, for example, whether the geographical information retrieval tasks are high-quality unless we get more detail. Moreover, details on the review process to ensure quality are not discussed (beyond the 1 sentence of "After initial cons
The main strengths of this paper can be summarized as, - The article is written in plain, easy-to-understand language. - The proposed dataset size is acceptable.
The main weaknesses and questions of the paper are listed below, - The issues in Introduction section: - Over-generalization of “artificial tools.” Claiming “many benchmarks rely on artificial tools” ignores widely used real-execution suites (e.g., SWE-bench runs tests on real repos; WebArena/BrowserGym execute live web actions; OpenDevin/AgentBench run OS/terminal tools). The statement needs qualifiers or counter-examples. Meanwhile, the abundance of prior similar work significantly dimi
1. The benchmark integrates 65 MCPs with 552 tools, with tool definitions exceeding 147k tokens. This is a leap from prior work that relied on simulated or mock tools, bringing evaluations much closer to real-world deployment scenarios. 2. The 250 tasks span realistic scenarios across information retrieval and system operations, with a three-level complexity taxonomy. The inclusion of time-sensitive tasks with real-time ground truth validation is particularly noteworthy. 3. The three evaluatio
Refer to questions section.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
