MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers
Chaithanya Bandi, Razvan-Gabriel Dumitru, Ben Hertzberg, Divyansh Agarwal, Geobio Boo, Tejas Polakam, Sami Hassaan, Jeff Da, HiJae Kim, Vipul Gupta, Manasi Sharma, Andrew Park, Martin Dimakis, Ernesto Gabriel Hernandez Montoya, Dan Rambado, Ivan Salazar, Rafael Cruz

TL;DR
MCP-Atlas is a comprehensive benchmark with 1,000 real-world tasks across 36 MCP servers designed to evaluate large language models' tool-use capabilities in realistic, multi-step, cross-server workflows with detailed scoring and diagnostics.
Contribution
It introduces a large-scale, real MCP server-based benchmark with structured scoring and diagnostics, addressing limitations of prior evaluations.
Findings
Models achieve up to 82.2% pass rate at 0.75 claim coverage.
63.3% of failures are due to cognitive issues, not tool-call errors.
High-performing models often fail after successful tool execution due to premature stopping.
Abstract
The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
