Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Weimin Xiong; Shuhao Gu; Bowen Ye; Zihao Yue; Lei Li; Feifan Song; Sujian Li; Hao Tian

arXiv:2605.14747·cs.CL·May 15, 2026

Video2GUI: Synthesizing Large-Scale Interaction Trajectories for Generalized GUI Agent Pretraining

Weimin Xiong, Shuhao Gu, Bowen Ye, Zihao Yue, Lei Li, Feifan Song, Sujian Li, Hao Tian

PDF

1 Repo

TL;DR

Video2GUI introduces an automated method to extract large-scale GUI interaction data from unlabeled videos, enabling pretraining of GUI agents that generalize better across diverse applications.

Contribution

The paper presents Video2GUI, a fully automated framework for creating extensive GUI interaction datasets from internet videos, facilitating improved pretraining of GUI agents.

Findings

01

Pretraining on WildGUI improves GUI grounding and action benchmark performance by 5-20%.

02

Constructed WildGUI dataset with 12 million interaction trajectories from 500 million videos.

03

Achieved state-of-the-art results on multiple GUI-related benchmarks.

Abstract

Recent advances in multimodal large language models have driven growing interest in graphical user interface (GUI) agents, yet their generalization remains constrained by the scarcity of large-scale training data spanning diverse real-world applications. Existing datasets rely heavily on costly manual annotations and are typically confined to narrow domains. To address this challenge, we propose Video2GUI, a fully automated framework that extracts grounded GUI interaction trajectories directly from unlabeled Internet videos. Video2GUI employs a coarse-to-fine filtering strategy to identify high-quality GUI tutorial videos and convert them into structured agent trajectories. Applying this pipeline to 500 million video metadata entries, we construct WildGUI, a large-scale dataset containing 12 million interaction trajectories spanning over 1,500 applications and websites. Pre-training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

weiminxiong/Video2GUI
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.