AUTO-Explorer: Automated Data Collection for GUI Agent
Xiangwu Guo, Difei Gao, Mike Zheng Shou

TL;DR
Auto-Explorer introduces an automated, low-cost data collection method for GUI agents that efficiently explores GUI environments, enabling rapid fine-tuning of multimodal large language models for software interaction tasks.
Contribution
It presents a novel exploration mechanism for GUI data collection, along with the UIXplore benchmark to evaluate exploration quality, improving model adaptation to new software environments.
Findings
Auto-Explorer outperforms existing methods in data collection efficiency.
Fine-tuning with Auto-Explorer data enhances MLLM performance on GUI tasks.
The UIXplore benchmark effectively measures exploration strategy quality.
Abstract
Recent advancements in GUI agents have significantly expanded their ability to interpret natural language commands to manage software interfaces. However, acquiring GUI data remains a significant challenge. Existing methods often involve designing automated agents that browse URLs from the Common Crawl, using webpage HTML to collect screenshots and corresponding annotations, including the names and bounding boxes of UI elements. However, this method is difficult to apply to desktop software or some newly launched websites not included in the Common Crawl. While we expect the model to possess strong generalization capabilities to handle this, it is still crucial for personalized scenarios that require rapid and perfect adaptation to new software or websites. To address this, we propose an automated data collection method with minimal annotation costs, named Auto-Explorer. It incorporates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWeb Data Mining and Analysis · Artificial Intelligence in Games · Speech and dialogue systems
