VCA: Video Curious Agent for Long Video Understanding
Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, Chuang Gan

TL;DR
This paper introduces VCA, a curiosity-driven video agent that autonomously explores long videos using a tree-search structure and intrinsic rewards from VLMs, achieving efficient and effective understanding of complex sequences.
Contribution
VCA is a novel self-exploring video agent that leverages intrinsic rewards and tree-search exploration to understand long videos more efficiently than existing methods.
Findings
VCA outperforms existing methods on multiple long video benchmarks.
VCA demonstrates higher efficiency in video understanding tasks.
VCA effectively captures crucial information without external rewards.
Abstract
Long video understanding poses unique challenges due to their temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as VCA. Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences. Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM's self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Video Analysis and Summarization · Advanced Image and Video Retrieval Techniques
