VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou; Zhexiao Huang; Yuan Guo; Zhangxuan Gu; Tianyu Xia; Zichen Luo; Fei Tang; Dehan Kong; Yanyi Shang; Suling Ou; Zhenlin Guo; Changhua Meng; Shuheng Shen

arXiv:2512.16501·cs.CV·December 19, 2025

VenusBench-GD: A Comprehensive Multi-Platform GUI Benchmark for Diverse Grounding Tasks

Beitong Zhou, Zhexiao Huang, Yuan Guo, Zhangxuan Gu, Tianyu Xia, Zichen Luo, Fei Tang, Dehan Kong, Yanyi Shang, Suling Ou, Zhenlin Guo, Changhua Meng, Shuheng Shen

PDF

Open Access 1 Datasets

TL;DR

VenusBench-GD is a large-scale, bilingual, multi-platform GUI benchmark with hierarchical evaluation for diverse grounding tasks, revealing strengths and weaknesses of current models across basic and advanced GUI grounding challenges.

Contribution

It introduces a comprehensive, cross-platform GUI benchmark with a hierarchical taxonomy and improved annotation quality, addressing limitations of existing benchmarks.

Findings

01

Multimodal models now outperform specialized GUI models on basic tasks.

02

Advanced tasks still favor GUI-specialized models but show overfitting and poor robustness.

03

The benchmark enables more thorough evaluation of GUI grounding models.

Abstract

GUI grounding is a critical component in building capable GUI agents. However, existing grounding benchmarks suffer from significant limitations: they either provide insufficient data volume and narrow domain coverage, or focus excessively on a single platform and require highly specialized domain knowledge. In this work, we present VenusBench-GD, a comprehensive, bilingual benchmark for GUI grounding that spans multiple platforms, enabling hierarchical evaluation for real-word applications. VenusBench-GD contributes as follows: (i) we introduce a large-scale, cross-platform benchmark with extensive coverage of applications, diverse UI elements, and rich annotated data, (ii) we establish a high-quality data construction pipeline for grounding tasks, achieving higher annotation accuracy than existing benchmarks, and (iii) we extend the scope of element grounding by proposing a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

inclusionAI/VenusBench-GD
dataset· 2.2k dl
2.2k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Speech and dialogue systems