Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara, Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur, Bucker, Lawrence Jang, Zack Hui

TL;DR
This paper introduces Windows Agent Arena, a scalable, multi-modal environment for evaluating OS agents on Windows, demonstrating its utility with a new agent Navi that performs tasks with measurable success rates.
Contribution
The paper presents a novel, scalable Windows-based benchmark environment for multi-modal OS agents, enabling rapid evaluation and comparison of agent capabilities.
Findings
Navi achieved a 19.5% success rate on Windows tasks.
Benchmark evaluation can be completed in as little as 20 minutes.
Navi performs well on the Mind2Web web-based benchmark.
Abstract
Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al.,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Mobile Agent-Based Network Management · Business Process Modeling and Analysis
