TL;DR
OSWorld-MCP is a comprehensive benchmark that evaluates multimodal agents' tool invocation, GUI operation, and decision-making in real-world scenarios, revealing current limitations and guiding future improvements.
Contribution
It introduces the first fair, automated benchmark for assessing MCP tool invocation and provides a curated set of high-quality tools for evaluation.
Findings
Tool invocation generally improves task success rates.
Current models have low tool invocation rates (~36%).
The benchmark highlights the need for enhanced tool invocation capabilities.
Abstract
With advances in decision-making and reasoning capabilities, multimodal agents show strong potential in computer application scenarios. Past evaluations have mainly assessed GUI interaction skills, while tool invocation abilities, such as those enabled by the Model Context Protocol (MCP), have been largely overlooked. Comparing agents with integrated tool invocation to those evaluated only on GUI interaction is inherently unfair. We present OSWorld-MCP, the first comprehensive and fair benchmark for assessing computer-use agents' tool invocation, GUI operation, and decision-making abilities in a real-world environment. We design a novel automated code-generation pipeline to create tools and combine them with a curated selection from existing tools. Rigorous manual validation yields 158 high-quality tools (covering 7 common applications), each verified for correct functionality,…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is original in extending OSWorld to evaluate MCP-based tool invocation, a capability overlooked in prior benchmarks. Its methodology is solid, combining automated tool generation with manual validation and extensive experiments on several state-of-the-art models. The presentation is clear and well-structured. The significance is high—OSWorld-MCP offers a realistic, open benchmark that will benefit research on tool-using multimodal agents. Its usefulness will further grow as more GUI ap
The benchmark’s scope is somewhat limited—its 158 tools cover mainly seven desktop apps, leaving out broader web or cross-platform settings. The tool-generation pipeline lacks quantitative evaluation of success and failure cases. Some tasks remain similar to the original OSWorld, reducing novelty in task design. While TIR and ACS are useful, their interpretation could be clearer. Finally, the benchmark’s long-term value depends on wider MCP adoption across GUI applications.
1. **Creation of a High-Quality and Rigorously Validated Toolset**: The paper's primary contribution is the construction of a novel benchmark centered around a "curated collection of 158 high-quality tools". The reliability of this toolset is strongly supported by a meticulous creation process, which combines a "novel automated code-generation pipeline" with "rigorous manual validation". This validation was not superficial; each tool was independently assessed by at least two experienced reviewe
1. **Incremental Novelty Built on an Existing Framework**: While the work is a significant engineering effort, its core contribution is presented as an extension to the pre-existing OSWorld benchmark. The paper states it is "Built upon a widely used... environment OSWorld" [cite: 077-078]. Consequently, the foundational novelty could be viewed as somewhat limited, as it enhances an established platform rather than introducing a completely new paradigm for agent evaluation. 2. **Insufficient Dep
- The automated pipeline (generation, filtering, wrapping) for tool creation is well-structured and uses both LLM-based and manual verification steps. - Covers multiple proprietary and open models under consistent settings, providing valuable comparative insights.
- While benchmark creation idea is strong, analysis mostly confirms intuitive findings (e.g., higher TIR means higher accuracy) - The paper’s main contribution (augmenting OSWorld with MCP tool invocation) is meaningful, but I’m concerned that OSWorld’s task design predates MCP and may not naturally align with tool-based workflows. The authors partly mitigate this by revalidating 361 tasks, identifying 250 “tool-beneficial” cases, and manually curating 158 tools, which shows care in adaptation.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
