AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials
Yiheng Xu, Dunjie Lu, Zhennan Shen, Junli Wang, Zekun Wang, Yuchen, Mao, Caiming Xiong, Tao Yu

TL;DR
AgentTrek introduces a scalable pipeline that synthesizes web agent trajectories from tutorials, enabling cost-effective training of GUI agents with multimodal data and achieving state-of-the-art results.
Contribution
We propose a novel automated method to generate high-quality web agent trajectories from tutorials, reducing reliance on manual annotation and enabling scalable training.
Findings
Achieves state-of-the-art performance on web browsing benchmarks
Reduces data collection cost to $0.55 per trajectory
Demonstrates effective multimodal, guided replay for agent training
Abstract
Graphical User Interface (GUI) agents can automate complex tasks across digital environments, but their development is hindered by the scarcity of high-quality trajectory data for training. Existing approaches rely on expensive human annotation, making them unsustainable at scale. We propose AgentTrek, a scalable data synthesis pipeline that generates web agent trajectories by leveraging publicly available tutorials. Our three-stage method: (1) automatically harvests and filters tutorial-like texts from the internet using a specialized classification model, (2) transforms these texts into structured task specifications with step-by-step instructions, and (3) employs a visual-language model (VLM) agent to execute these instructions in real environments, while a VLM-based evaluator verifies trajectory correctness. The synthesized trajectories encompass multiple modalities, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMulti-Agent Systems and Negotiation · Human Motion and Animation · Natural Language Processing Techniques
