DocOS: Towards Proactive Document-Guided Actions in GUI Agents
Jingjing Liu, Ziye Huang, Zihao Cheng, Zeming Liu, Jiahong Wu, Yuhang Guo, Kehai Chen, Yunhong Wang, Haifeng Wang

TL;DR
This paper introduces DocOS, a benchmark for evaluating GUI agents' ability to autonomously search and utilize online documentation to handle complex, long-tailed tasks in dynamic web environments.
Contribution
The paper proposes a new paradigm of proactive document-guided actions for GUI agents and presents DocOS, a benchmark to assess their problem-solving capabilities in real-world scenarios.
Findings
Agents face challenges in reliably locating relevant information.
Difficulty in accurately grounding instructions into GUI actions.
Progress is limited by search and grounding bottlenecks.
Abstract
While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
