WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

Haoren Zhao; Tianyi Chen; Zhen Wang

arXiv:2605.16402·cs.CV·May 19, 2026

WinDeskGround: A Benchmark for Robust GUI Grounding in Complex Multi-Window Desktop Environments

Haoren Zhao, Tianyi Chen, Zhen Wang

PDF

1 Repo

TL;DR

WinDeskGround introduces a benchmark and framework for evaluating the robustness of GUI grounding in complex multi-window desktop environments, addressing real-world challenges like occlusion and clutter.

Contribution

It presents a novel, parametric synthesis framework and a diverse dataset to evaluate and improve GUI grounding robustness in realistic desktop scenarios.

Findings

01

Top-tier MLLMs perform well in simple settings but struggle with occlusion.

02

WinDeskGround reveals significant accuracy drops under partial occlusion.

03

The benchmark facilitates assessing and advancing GUI agent robustness.

Abstract

Multimodal Large Language Models (MLLMs) have revolutionized GUI automation, yet their efficacy is largely established on idealized, single-layer interfaces. This paper identifies a critical reliability gap: state-of-the-art agents face distinct robustness challenges in real-world desktop environments characterized by multi-window stacking, occlusion, and visual clutter. To address this, we introduce WinDeskGround, a novel benchmark and synthesis framework tailored for evaluating GUI grounding robustness. Unlike static datasets, our framework parametrically generates complex desktop scenarios by controlling window occlusion, layout density, and semantic similarity, thereby simulating the distribution shifts of authentic workflows. We construct a diverse meta-dataset of 1,356 high-fidelity instruction-target pairs and conduct comprehensive evaluations of five leading MLLMs. Our results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ZZZhr-1/WinDeskGround
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.