TL;DR
AutoGUI-v2 introduces a comprehensive benchmark for evaluating deep understanding of GUI functionalities and interaction outcomes, addressing limitations of existing benchmarks and revealing current model strengths and weaknesses.
Contribution
It presents a novel, large-scale benchmark with a hierarchical, multi-platform dataset to assess GUI comprehension and interaction prediction capabilities of vision-language models.
Findings
Open-source models excel at functional grounding.
Commercial models outperform in functionality captioning.
All models struggle with complex, uncommon interaction logic.
Abstract
Autonomous agents capable of navigating Graphical User Interfaces (GUIs) hold the potential to revolutionize digital productivity. However, achieving true digital autonomy extends beyond reactive element matching; it necessitates a predictive mental model of interface dynamics and the ability to foresee the "digital world state" resulting from interactions. Despite the perceptual capabilities of modern Vision-Language Models (VLMs), existing benchmarks remain bifurcated (focusing either on black-box task completion or static, shallow grounding), thereby failing to assess whether agents truly comprehend the implicit functionality and transition logic of GUIs. To bridge this gap, we introduce AutoGUI-v2, a comprehensive benchmark designed to evaluate deep GUI functionality understanding and interaction outcome prediction. We construct the benchmark using a novel VLM-human collaborative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
