WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Jinchao Li; Yunxin Li; Chenrui Zhao; Zhenran Xu; Baotian Hu; Min Zhang

arXiv:2604.27776·cs.AI·May 1, 2026

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, Min Zhang

PDF

1 Repo 1 Datasets

TL;DR

WindowsWorld is a new benchmark designed to evaluate GUI agents in complex, multi-application professional workflows, revealing significant performance gaps in current models.

Contribution

The paper introduces WindowsWorld, a process-centric benchmark for multi-application GUI tasks, including a comprehensive dataset and evaluation framework for assessing agent capabilities.

Findings

01

Leading models perform below 21% success on multi-application tasks

02

Models struggle with conditional reasoning across three or more applications

03

Tasks often fail despite models exceeding human step limits

Abstract

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

HITsz-TMG/WindowsWorld
github

Datasets

HalfCooler/WindowsWorld
dataset· 51 dl
51 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.