TL;DR
Video2GUI introduces an automated method to extract large-scale GUI interaction data from unlabeled videos, enabling pretraining of GUI agents that generalize better across diverse applications.
Contribution
The paper presents Video2GUI, a fully automated framework for creating extensive GUI interaction datasets from internet videos, facilitating improved pretraining of GUI agents.
Findings
Pretraining on WildGUI improves GUI grounding and action benchmark performance by 5-20%.
Constructed WildGUI dataset with 12 million interaction trajectories from 500 million videos.
Achieved state-of-the-art results on multiple GUI-related benchmarks.
Abstract
Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
