VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Dunjie Lu; Yiheng Xu; Junli Wang; Haoyuan Wu; Xinyuan Wang; Zekun Wang; Junlin Yang; Hongjin Su; Jixuan Chen; Junda Chen; Yuchen Mao; Jingren Zhou; Junyang Lin; Binyuan Hui; Tao Yu

arXiv:2510.19488·cs.CL·October 23, 2025

VideoAgentTrek: Computer Use Pretraining from Unlabeled Videos

Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, Tao Yu

PDF

Open Access

TL;DR

VideoAgentTrek introduces a scalable method to automatically extract GUI interaction data from publicly available videos, significantly reducing manual annotation effort and improving computer-use agent performance.

Contribution

The paper presents VideoAgentTrek and Video2Action, novel modules that mine and structure interaction data from unlabeled videos for training agents.

Findings

01

Generated 1.52 million interaction steps from 39,000 videos.

02

Achieved a 70% relative improvement in task success rates.

03

Enhanced step accuracy from 64.1% to 69.3%.

Abstract

Training computer-use agents requires massive amounts of GUI interaction data, but manually annotating action trajectories at scale is prohibitively expensive. We present VideoAgentTrek, a scalable pipeline that automatically mines training data from publicly available screen-recorded videos at web scale, eliminating the need for manual annotation. Our approach addresses a key challenge: raw videos contain implicit demonstrations but lack explicit action labels. To solve this, we develop Video2Action, an inverse dynamics module (IDM) with two components: (1) a video grounding model that detects and localizes GUI actions with precise temporal boundaries and context, and (2) an action-content recognizer that extracts structured parameters like click coordinates and typed text with high fidelity. Applied to 39,000 YouTube tutorial videos, our pipeline generates 1.52 million interaction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Human Motion and Animation