Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Ziyang Wang; Honglu Zhou; Shijie Wang; Junnan Li; Caiming Xiong; Silvio Savarese; Mohit Bansal; Michael S. Ryoo; Juan Carlos Niebles

arXiv:2512.05774·cs.CV·December 8, 2025

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

PDF

Open Access

TL;DR

This paper introduces Active Video Perception (AVP), an interactive framework for long video understanding that actively seeks relevant evidence, significantly improving accuracy and efficiency over existing methods.

Contribution

AVP is a novel evidence-seeking framework that enables LVU agents to actively decide observations, improving relevance and efficiency in long video comprehension.

Findings

01

AVP achieves state-of-the-art accuracy on five LVU benchmarks.

02

AVP reduces inference time by 81.6% compared to baseline methods.

03

AVP outperforms previous agentic approaches by 5.7% in average accuracy.

Abstract

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis