Learning from Online Videos at Inference Time for Computer-Use Agents

Yujian Liu; Ze Wang; Hao Chen; Ximeng Sun; Xiaodong Yu; Jialian Wu; Jiang Liu; Emad Barsoum; Zicheng Liu; Shiyu Chang

arXiv:2511.04137·cs.CV·November 7, 2025

Learning from Online Videos at Inference Time for Computer-Use Agents

Yujian Liu, Ze Wang, Hao Chen, Ximeng Sun, Xiaodong Yu, Jialian Wu, Jiang Liu, Emad Barsoum, Zicheng Liu, Shiyu Chang

PDF

Open Access

TL;DR

This paper introduces a framework enabling computer-use agents to learn from online tutorial videos at inference time, improving their ability to perform complex, domain-specific tasks by dynamically selecting visual guidance.

Contribution

The authors propose a novel method that retrieves, segments, and dynamically selects video trajectories as in-context guidance, enhancing agent performance over text-only approaches.

Findings

01

Outperforms baseline agents on benchmark tasks

02

Trajectory segmentation and selection are crucial for success

03

Visual information significantly improves guidance quality

Abstract

Computer-use agents can operate computers and automate laborious tasks, but despite recent rapid progress, they still lag behind human users, especially when tasks require domain-specific procedural knowledge about particular applications, platforms, and multi-step workflows. Humans can bridge this gap by watching video tutorials: we search, skim, and selectively imitate short segments that match our current subgoal. In this paper, we study how to enable computer-use agents to learn from online videos at inference time effectively. We propose a framework that retrieves and filters tutorial videos, converts them into structured demonstration trajectories, and dynamically selects trajectories as in-context guidance during execution. Particularly, using a VLM, we infer UI actions, segment videos into short subsequences of actions, and assign each subsequence a textual objective. At…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Topic Modeling