Video-Browser: Towards Agentic Open-web Video Browsing
Zhengyang Liang, Yan Shu, Xiangrui Liu, Minghao Qin, Kaixin Liang, Nicu Sebe, Zheng Liu, Lizi Liao

TL;DR
This paper introduces Video-Browser, a new agentic framework for open-web video browsing that balances visual perception and efficiency, significantly improving performance and reducing costs in open-ended video exploration.
Contribution
We formalize the task of Agentic Video Browsing, propose the Video-Browser framework with Pyramidal Perception, and establish a benchmark for open-ended video exploration.
Findings
Achieved 37.5% relative improvement over baseline methods.
Reduced token consumption by 58.3% compared to direct visual inference.
Established a foundation for verifiable open-web video research.
Abstract
The evolution of autonomous agents is redefining information seeking, transitioning from passive retrieval to proactive, open-ended web research. However, a significant modality gap remains in processing the web's most dynamic and information-dense modality: video. In this paper, we first formalize the task of Agentic Video Browsing and introduce Video-BrowseComp, a benchmark evaluating open-ended agentic browsing tasks that enforce a mandatory dependency on videos. We observe that current paradigms struggle to reconcile the scale of open-ended video exploration with the need for fine-grained visual verification. Direct visual inference (e.g., RAG) maximizes perception but incurs prohibitive context costs, while text-centric summarization optimizes efficiency but often misses critical visual details required for accurate grounding. To address this, we propose Video-Browser, a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Artificial Intelligence in Games
