TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Reza Esfandiarpoor, Vishwas Suryanarayanan, Stephen H. Bach, Vishal Chowdhary, Anthony Aue

TL;DR
The paper introduces TheMCPCompany, a benchmark for evaluating large language model agents using task-specific tools via MCP, highlighting current capabilities and challenges in complex real-world environments.
Contribution
It presents a new benchmark with over 18,000 tools for evaluating tool-calling agents and analyzes their performance in complex, real-world tasks.
Findings
Tool-calling agents can outperform browser-based agents with perfect tool retrieval.
Smaller models struggle to utilize large tool sets effectively.
GPT-5 performs close to ground-truth tools in complex environments.
Abstract
Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool…
Peer Reviews
Decision·Submitted to ICLR 2026
- Strong engineering contribution: Implements a large, fully functional MCP benchmark with 18,000+ tools across enterprise services. - Systematic evaluation pipeline: Builds upon TheAgentCompany with added realism (Azure integration). - Empirical comparison: Includes quantitative cost and accuracy analysis between MCP and browser-based setups. - Reproducibility commitment: The authors intend to release code, MCP servers, and Terraform scripts.
- Limited model coverage: Only closed-source models from OpenAI and Anthropic are evaluated; Gemini and other open-source models (eg., DeepSeek-V3, Qwen3, Llama) are excluded. This limits generalizability. - Lack of retrieval comparison: The paper does not directly compare MCPAgent with traditional retrieval-based methods, making it unclear whether MCPAgent offers genuine advantages. - Narrow task scope: The actual benchmark tasks are mainly Azure tasks, and other major components (e.g., TheAgen
1. The MCPAgent incorporates 18,000 tools and introduces a gateway MCP server to retrieve the tools relevant to each user query, thereby improving performance and reducing operational costs. 2. This paper evaluates the MCPAgent on challenging tasks that reflect the complexity of real scenarios.
1. Although constructing a standardized set of MCP tools requires substantial engineering effort, the novelty of this paper appears to be limited. 2. Some experiment setups are confusing. For example, in Table 2, the comparison between the **MCPAgent** and the **Oracle Tool Set** supports the claimed advantages of introducing a gateway MCP server. However, it is unclear why the **MCPAgent** is also compared with the **browser-based agent**, given that their functionalities and supported capabili
1. This paper tackles a very relevant and interesting angle: understanding the capabilities of general-purpose agents when they are equipped with large, heterogeneous tool collections. Studying how LLMs perform as the number of available tools scales is highly realistic and timely, given the fast-evolving ecosystem of MCP tools. 2. The writing and motivation are experienced and clear, making the paper's contributions easy to grasp. The design of the benchmark is intuitive and well-justified; it
The main weakness lies in the insufficient details provided for the MCPAgent's tool-finding function. This module is central to the paper's investigation of agents in large-scale tool environments, yet its implementation is only briefly described. Specifically, the choice of the embedding model is a critical design decision that could significantly impact retrieval quality and, consequently, the overall agent performance. The authors state, "We use OpenAI’s text-embedding-3-large model," but the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Natural Language Processing Techniques · Topic Modeling
