GUIDE: Resolving Domain Bias in GUI Agents through Real-Time Web Video Retrieval and Plug-and-Play Annotation
Rui Xie, Zhi Gao, Chenrui Shi, Zirui Shang, Lu Chen, and Qing Li

TL;DR
GUIDE is a training-free framework that enhances GUI agents by autonomously acquiring domain-specific knowledge from web tutorial videos, significantly reducing domain bias and improving task performance.
Contribution
GUIDE introduces a novel retrieval-augmented annotation pipeline that enables GUI agents to learn domain-specific expertise without retraining, improving their real-world applicability.
Findings
GUIDE improves GUI agent performance by over 5% on OSWorld benchmarks.
It reduces execution steps without modifying model parameters.
The framework demonstrates broad applicability as a plug-and-play component.
Abstract
Large vision-language models have endowed GUI agents with strong general capabilities for interface understanding and interaction. However, due to insufficient exposure to domain-specific software operation data during training, these agents exhibit significant domain bias - they lack familiarity with the specific operation workflows (planning) and UI element layouts (grounding) of particular applications, limiting their real-world task performance. In this paper, we present GUIDE (GUI Unbiasing via Instructional-Video Driven Expertise), a training-free, plug-and-play framework that resolves GUI agent domain bias by autonomously acquiring domain-specific expertise from web tutorial videos through a retrieval-augmented automated annotation pipeline. GUIDE introduces two key innovations. First, a subtitle-driven Video-RAG pipeline unlocks video semantics through subtitle analysis,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
