MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

Xuanqi Gao; Siyi Xie; Juan Zhai; Shiqing Ma; Chao Shen

arXiv:2505.16700·cs.AI·October 14, 2025

MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

Xuanqi Gao, Siyi Xie, Juan Zhai, Shiqing Ma, Chao Shen

PDF

3 Reviews

TL;DR

MCP-RADAR is a comprehensive benchmark designed to evaluate large language models' ability to utilize external tools within the MCP framework across multiple domains, emphasizing objective performance metrics.

Contribution

This paper introduces MCP-RADAR, the first standardized, multi-domain benchmark for assessing LLMs' tool use capabilities under the MCP protocol, with detailed evaluation metrics.

Findings

01

Distinct capability profiles for different LLMs.

02

Trade-off observed between accuracy and efficiency.

03

Benchmark provides actionable insights for LLM and tool development.

Abstract

As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of interacting with external tools, the Model Context Protocol (MCP) has emerged as a key standardized framework for dynamic tool discovery and orchestration. Despite its widespread industry adoption, existing evaluation methods do not adequately assess tool utilization capabilities under this new paradigm. To address this gap, this paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance within the MCP framework. MCP-RADAR features a challenging dataset of 507 tasks spanning six domains: mathematical reasoning, web search, email, calendar, file management, and terminal operations. It quantifies performance based on two primary criteria: answer correctness and operational accuracy. To closely emulate real-world usage, our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

MCP usage is a highly meaningful evaluation direction. Authors have constructed a diverse and challenging benchmark that even advanced models perform poorly on it. Moreover, their analysis of model behavior provides substantial value for future research.

Weaknesses

- The authors did not compare against a baseline that does not use tool calling, making it unreasonable to claim a causal relationship between model performance and MCP tool usage. - Some tasks have very few evaluation examples, like only 28 for calendar, which introduces instability in the evaluation. Repeated runs could yield significantly variances. - Figure 4 and Table 3 are highly redundant and neither provides an average score to summarize each model’s overall performance across tasks.

Reviewer 02Rating 2Confidence 5

Strengths

The strengths of the paper can be summarized as, - The paper’s code and data are open-sourced, which benefits community progress in this area and helps ensure the work’s credibility and reproducibility. - Proposes a benchmark spanning six domains and two task types.

Weaknesses

The weaknesses of the paper are listed as follow, - Introduction Section: - Causal overclaim about MCP. The authors say the shift to tool-using agents was “significantly accelerated by the advent of the Model Context Protocol (MCP)” and that MCP is “a standardized framework for dynamic tool discovery and orchestration.” That’s disputable: tool use predates MCP (e.g., function calling, ReAct/Toolformer, agent frameworks), and MCP is one proposal among several—not an established, consensus “

Reviewer 03Rating 4Confidence 4

Strengths

1. This paper evaluates the tool-use capability of LLM Agents with MCP. 2. The detailed analysis of errors (tool-use errors, reasoning errors, information synthesis errors) provides useful insights into current LLM Agent limitations.

Weaknesses

Despite being positioned as an MCP-focused benchmark, MCP-RADAR's overall structure largely resembles that of prior tool-use evaluation benchmarks.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.