AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
Hongxin Li, Jingfan Chen, Jingran Su, Yuntao Chen, Qing Li, Zhaoxiang Zhang

TL;DR
AutoGUI introduces a scalable pipeline leveraging large language models to automatically generate detailed UI element functionality annotations, significantly improving UI grounding and enabling advanced software automation tasks.
Contribution
The paper presents a novel LLM-based method for large-scale, high-quality UI annotation, creating the AutoGUI-704k dataset with detailed functionality labels.
Findings
AutoGUI-704k dataset improves UI grounding accuracy.
Annotations are comparable to human quality.
Enhanced performance in UI agent tasks.
Abstract
User interface understanding with vision-language models (VLMs) has received much attention due to its potential for enhancing software automation. However, existing datasets used to build UI-VLMs either only contain large-scale context-free element annotations or contextualized functional descriptions for elements at a small scale. In this work, we propose the \textbf{AutoGUI} pipeline for automatically annotating UI elements with detailed functionality descriptions at scale. Specifically, we leverage large language models (LLMs) to infer element functionality by comparing UI state changes before and after simulated interactions. To improve annotation quality, we propose LLM-aided rejection and verification, eliminating invalid annotations without human labor. We construct a high-quality AutoGUI-704k dataset using the proposed pipeline, featuring diverse and detailed functionality…
Peer Reviews
Decision·Submitted to ICLR 2025
1.AutoGUI pipeline provides a scalable solution to manual UI annotation by using LLMs for functionality labeling, reducing labor and advancing VLM understanding of UI elements. 2. The pipeline annotates functionality based on UI dynamics, using LLMs to analyze content changes triggered by interactions. This approach enables functionality labeling without manual intervention, capturing detailed functional nuances. 3. AutoGUI-704k dataset covers Web, Mobile device types and UI contexts, valuable
1. The experiments focus on specific test sets and benchmarks but lack an analysis of the finetuned model’s generalization across diverse UI types and applications. This may affect the pipeline’s robustness in handling various UI designs, platforms, and complex interactions in real-world settings. 2. Although there are some human checks, the pipeline relies heavily on LLMs for rejection and verification. This raises concerns about whether LLM-based processes alone can consistently maintain high
- The paper is well written and easy to read. - The figures presented in the paper are useful for helping understand the presented pipeline with real examples. - The AutoGUI pipeline offers good advantages over most of its competitors (as shown in Table 1), specially for 1) its scalability and automation (removing the need for costly human annotation), 2) contextualized functionality annotations, 3) dataset size, and 4) coverage of both web and android - The analysis on data quality comparing di
- The dataset would enjoy more advantages if it contained more platforms other than websites and Android UI. For instance, extending it to other operating systems, and new UI applications. - The number of baselines appears quite limited. I would like to see the performance of state-of-the-art VLMs (boths open and closed source). Results from closed GPT4o, Claude 3.5 Sonnet, or Gemini would help in comparing methods on the presented benchmarks. Similarly, there are a number of powerful open-sourc
Quality & Clarity The paper points out the shortcomings of current GUI datasets and proposes a data collection pipeline to address them. The collected data is first analyzed for correctness to ensure its quality, and its effectiveness is subsequently demonstrated through experiments. The writing is logically structured and clearly expressed.
Limited Evidence: The experiments cannot fully demonstrate the effectiveness of AutoGUI data. 1. This paper evaluates AutoGUI data on 6 benchmarks as shown in Table 4. The effectiveness of AutoGUI data can be assessed by comparing the results of *Qwen-VL-AutoGUI702k* and *SeeClick* as they use the same base model. The results on *FuncPred* benchmark are excluded from consideration as *FuncPred* is derived from AutoGUI dataset and shares the same data distribution with it. In the remaining 5 ben
Code & Models
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
