MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Chaithanya Bandi; Razvan-Gabriel Dumitru; Ben Hertzberg; Divyansh Agarwal; Geobio Boo; Tejas Polakam; Sami Hassaan; Jeff Da; HiJae Kim; Vipul Gupta; Manasi Sharma; Andrew Park; Martin Dimakis; Ernesto Gabriel Hernandez Montoya; Dan Rambado; Ivan Salazar; Rafael Cruz; MohammadHossein Rezaei; Chetan Rane; Ben Levin; Daniel Yue Zhang; Brad Kenstler; Bing Liu

arXiv:2602.00933·cs.SE·May 21, 2026

MCP-Atlas: A Large-Scale Benchmark for Tool-Use Competency with Real MCP Servers

Chaithanya Bandi, Razvan-Gabriel Dumitru, Ben Hertzberg, Divyansh Agarwal, Geobio Boo, Tejas Polakam, Sami Hassaan, Jeff Da, HiJae Kim, Vipul Gupta, Manasi Sharma, Andrew Park, Martin Dimakis, Ernesto Gabriel Hernandez Montoya, Dan Rambado, Ivan Salazar, Rafael Cruz

PDF

1 Repo

TL;DR

MCP-Atlas is a comprehensive benchmark with 1,000 real-world tasks across 36 MCP servers designed to evaluate large language models' tool-use capabilities in realistic, multi-step, cross-server workflows with detailed scoring and diagnostics.

Contribution

It introduces a large-scale, real MCP server-based benchmark with structured scoring and diagnostics, addressing limitations of prior evaluations.

Findings

01

Models achieve up to 82.2% pass rate at 0.75 claim coverage.

02

63.3% of failures are due to cognitive issues, not tool-call errors.

03

High-performing models often fail after successful tool execution due to premature stopping.

Abstract

The Model Context Protocol (MCP) is emerging as a standard interface through which large language model (LLM) agents discover and invoke external tools. However, existing MCP evaluations fall short along three key axes: realistic multi-step workflows with cross-server orchestration, breadth across authentic MCP servers rather than mocks, and structured, reproducible claim-level scoring disentangled from agent verbosity or style. We introduce MCP-Atlas, a benchmark for measuring tool-use competency against production MCP servers. MCP-Atlas contains 1,000 natural-language tasks written and verified by human experts spanning 36 real MCP servers and 220 tools. Prompts do not specify servers, tools, or parameters, requiring agents to identify relevant tools among semantically plausible distractors and to compose multi-step, cross-server workflows. Each task is scored with a claim-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

scaleapi/mcp-atlas
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.