AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Hongxin Li; Xiping Wang; Jingran Su; Zheng Ju; Yuntao Chen; Qing Li; Zhaoxiang Zhang

arXiv:2604.24441·cs.CV·April 28, 2026

AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark

Hongxin Li, Xiping Wang, Jingran Su, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang

PDF

1 Repo

TL;DR

AutoGUI-v2 introduces a comprehensive benchmark for evaluating deep understanding of GUI functionalities and interaction outcomes, addressing limitations of existing benchmarks and revealing current model strengths and weaknesses.

Contribution

It presents a novel, large-scale benchmark with a hierarchical, multi-platform dataset to assess GUI comprehension and interaction prediction capabilities of vision-language models.

Findings

01

Open-source models excel at functional grounding.

02

Commercial models outperform in functionality captioning.

03

All models struggle with complex, uncommon interaction logic.

Abstract

Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zjulihongxin/AutoGUI-v2
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.