WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Henry Hengyuan Zhao; Kaiming Yang; Wendi Yu; Difei Gao; Mike Zheng Shou

arXiv:2502.08047·cs.AI·February 24, 2026

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou

PDF

Open Access 1 Repo 1 Datasets 1 Video 4 Reviews

TL;DR

WorldGUI introduces a comprehensive benchmark and framework to evaluate and improve the robustness of GUI agents in diverse, real-world starting states, addressing a critical gap in current GUI automation research.

Contribution

The paper presents WorldGUI, a new benchmark with varied initial states for desktop and web applications, and a model-agnostic agent framework to enhance planning reliability in dynamic environments.

Findings

01

State-of-the-art GUI agents perform poorly under non-default conditions.

02

Benchmark reveals limited robustness and fragile planning behaviors.

03

Framework improves reliability in dynamic, real-world scenarios.

Abstract

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 3

Strengths

1. The authors introduce an interesting problem of varying the initial state of the task to determine how the agent would perform. Aside from how thoroughly it is tested, the problem itself is interesting. 2. The authors introduce methods to vary the initial state of the task and benchmark models to show that performance drops. This is a surprising result. Further, the authors show that their method outperforms Plan and Act, highlighting the importance of reflection during planning and action ex

Weaknesses

1. The authors could have addressed the question of varying initial states by taking existing benchmarks like OsWorld or Windows Agent Arena and running their augmentation methods on these benchmarks. It is not clear what the advantage of introducing a new benchmark is. The videos, as well as ground truth plans, could be obtained from the successful completion of tasks from exisiting benchamrks. This would make it much easier to understand the core question of how varying initial states affect m

Reviewer 02Rating 4Confidence 4

Strengths

1. The main novelty is to evaluate GUI agents from many non default initial states. In real use, a user often calls an assistant in the middle of a workflow or with the app in a non default configuration. Existing benchmarks usually fix the initial state and therefore miss this difficulty. WorldGUI fills this gap with explicit pre actions that alter the context or put the task in an intermediate state. 2. The proposed Planner Critic, Step Check, and Actor Critic form a simple verify then correct

Weaknesses

1. Verify and correct loops and plan self-critique have already appeared in recent agents for GUI and web, for example, Agent S, Agent S2, and also earlier reflection-style methods. WorldGUI Agent packages these ideas well, but the technical novelty beyond a careful engineering of prompts and modules is modest. It would help to show larger gaps against recent native models like UI TARS and ShowUI under matched conditions. 2. The abstract claims an improvement of 1.7% on WindowsAgentArena. Table

Reviewer 03Rating 2Confidence 4

Strengths

1. Existing GUI benchmarks assume default initial states; this work addresses that gap with systematic state diversification 2. 611 tasks across 10 apps, each with five augmentations; annotation and data-construction pipeline is described.

Weaknesses

1. Benchmark novelty relative to AssistGUI/OSWorld is mostly the “pre-action” augmentation; more rigorous quantitative analysis of state-diversity (e.g., edit distance between GT plans, UI tree variance) is missing. 2. Limited discussion of annotation quality: inter-annotator agreement and error rates for GT plans/pre-actions are not reported. 3. recent UIExplorer (2025) and GUI-World (2025) datasets/agents are not compared or cited, though they target dynamic GUI exploration. 4. Planner uses

Reviewer 04Rating 6Confidence 3

Strengths

The paper presents a timely and well-motivated contribution by introducing the first GUI benchmark that emphasizes diverse and non-default initial states, addressing a key gap in existing evaluations. The proposed WorldGUI-Agent integrates three critical reasoning modules (planning critique, step validation, and action correction) into a coherent and practical framework that significantly outperforms prior baselines. The benchmark is comprehensive, covering realistic desktop and web environments

Weaknesses

1. The framework heavily relies on instructional videos to guide planning. However, the paper does not clarify how the quality, consistency, or completeness of these videos is ensured. Since the video is only used during the initial planning phase and is processed via Whisper into subtitles, it's unclear why raw video is needed at all—could equivalent textual guidance achieve the same effect? A discussion on this design choice is missing. 2. While the proposed “WorldGUI-Agent: Thinking Before D

Code & Models

Repositories

showlab/WorldGUI
noneOfficial

Datasets

hhenryz/WorldGUI-Bench
dataset· 162 dl
162 dl

Videos

WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point· underline

Taxonomy

TopicsReal-Time Systems Scheduling · Embedded Systems Design Techniques · Real-time simulation and control systems