WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point
Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou

TL;DR
WorldGUI introduces a comprehensive benchmark and framework to evaluate and improve the robustness of GUI agents in diverse, real-world starting states, addressing a critical gap in current GUI automation research.
Contribution
The paper presents WorldGUI, a new benchmark with varied initial states for desktop and web applications, and a model-agnostic agent framework to enhance planning reliability in dynamic environments.
Findings
State-of-the-art GUI agents perform poorly under non-default conditions.
Benchmark reveals limited robustness and fragile planning behaviors.
Framework improves reliability in dynamic, real-world scenarios.
Abstract
Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The authors introduce an interesting problem of varying the initial state of the task to determine how the agent would perform. Aside from how thoroughly it is tested, the problem itself is interesting. 2. The authors introduce methods to vary the initial state of the task and benchmark models to show that performance drops. This is a surprising result. Further, the authors show that their method outperforms Plan and Act, highlighting the importance of reflection during planning and action ex
1. The authors could have addressed the question of varying initial states by taking existing benchmarks like OsWorld or Windows Agent Arena and running their augmentation methods on these benchmarks. It is not clear what the advantage of introducing a new benchmark is. The videos, as well as ground truth plans, could be obtained from the successful completion of tasks from exisiting benchamrks. This would make it much easier to understand the core question of how varying initial states affect m
1. The main novelty is to evaluate GUI agents from many non default initial states. In real use, a user often calls an assistant in the middle of a workflow or with the app in a non default configuration. Existing benchmarks usually fix the initial state and therefore miss this difficulty. WorldGUI fills this gap with explicit pre actions that alter the context or put the task in an intermediate state. 2. The proposed Planner Critic, Step Check, and Actor Critic form a simple verify then correct
1. Verify and correct loops and plan self-critique have already appeared in recent agents for GUI and web, for example, Agent S, Agent S2, and also earlier reflection-style methods. WorldGUI Agent packages these ideas well, but the technical novelty beyond a careful engineering of prompts and modules is modest. It would help to show larger gaps against recent native models like UI TARS and ShowUI under matched conditions. 2. The abstract claims an improvement of 1.7% on WindowsAgentArena. Table
1. Existing GUI benchmarks assume default initial states; this work addresses that gap with systematic state diversification 2. 611 tasks across 10 apps, each with five augmentations; annotation and data-construction pipeline is described.
1. Benchmark novelty relative to AssistGUI/OSWorld is mostly the “pre-action” augmentation; more rigorous quantitative analysis of state-diversity (e.g., edit distance between GT plans, UI tree variance) is missing. 2. Limited discussion of annotation quality: inter-annotator agreement and error rates for GT plans/pre-actions are not reported. 3. recent UIExplorer (2025) and GUI-World (2025) datasets/agents are not compared or cited, though they target dynamic GUI exploration. 4. Planner uses
The paper presents a timely and well-motivated contribution by introducing the first GUI benchmark that emphasizes diverse and non-default initial states, addressing a key gap in existing evaluations. The proposed WorldGUI-Agent integrates three critical reasoning modules (planning critique, step validation, and action correction) into a coherent and practical framework that significantly outperforms prior baselines. The benchmark is comprehensive, covering realistic desktop and web environments
1. The framework heavily relies on instructional videos to guide planning. However, the paper does not clarify how the quality, consistency, or completeness of these videos is ensured. Since the video is only used during the initial planning phase and is processed via Whisper into subtitles, it's unclear why raw video is needed at all—could equivalent textual guidance achieve the same effect? A discussion on this design choice is missing. 2. While the proposed “WorldGUI-Agent: Thinking Before D
Code & Models
Videos
Taxonomy
TopicsReal-Time Systems Scheduling · Embedded Systems Design Techniques · Real-time simulation and control systems
