TL;DR
MCP-RADAR is a comprehensive benchmark designed to evaluate large language models' ability to utilize external tools within the MCP framework across multiple domains, emphasizing objective performance metrics.
Contribution
This paper introduces MCP-RADAR, the first standardized, multi-domain benchmark for assessing LLMs' tool use capabilities under the MCP protocol, with detailed evaluation metrics.
Findings
Distinct capability profiles for different LLMs.
Trade-off observed between accuracy and efficiency.
Benchmark provides actionable insights for LLM and tool development.
Abstract
As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of interacting with external tools, the Model Context Protocol (MCP) has emerged as a key standardized framework for dynamic tool discovery and orchestration. Despite its widespread industry adoption, existing evaluation methods do not adequately assess tool utilization capabilities under this new paradigm. To address this gap, this paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance within the MCP framework. MCP-RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations. It quantifies performance based on two primary criteria: answer correctness and operational accuracy. To closely emulate real-world usage, our…
Peer Reviews
Decision·Submitted to ICLR 2026
MCP usage is a highly meaningful evaluation direction. Authors have constructed a diverse and challenging benchmark that even advanced models perform poorly on it. Moreover, their analysis of model behavior provides substantial value for future research.
- The authors did not compare against a baseline that does not use tool calling, making it unreasonable to claim a causal relationship between model performance and MCP tool usage. - Some tasks have very few evaluation examples, like only 28 for calendar, which introduces instability in the evaluation. Repeated runs could yield significantly variances. - Figure 4 and Table 3 are highly redundant and neither provides an average score to summarize each model’s overall performance across tasks.
The strengths of the paper can be summarized as, - The paper’s code and data are open-sourced, which benefits community progress in this area and helps ensure the work’s credibility and reproducibility. - Proposes a benchmark spanning six domains and two task types.
The weaknesses of the paper are listed as follow, - Introduction Section: - Causal overclaim about MCP. The authors say the shift to tool-using agents was “significantly accelerated by the advent of the Model Context Protocol (MCP)” and that MCP is “a standardized framework for dynamic tool discovery and orchestration.” That’s disputable: tool use predates MCP (e.g., function calling, ReAct/Toolformer, agent frameworks), and MCP is one proposal among several—not an established, consensus “
1. This paper evaluates the tool-use capability of LLM Agents with MCP. 2. The detailed analysis of errors (tool-use errors, reasoning errors, information synthesis errors) provides useful insights into current LLM Agent limitations.
Despite being positioned as an MCP-focused benchmark, MCP-RADAR's overall structure largely resembles that of prior tool-use evaluation benchmarks.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
