Act2See: Emergent Active Visual Perception for Video Reasoning
Martin Q. Ma, Yuxiao Qu, Aditya Agrawal, Willis Guo, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency

TL;DR
Act2See introduces an active visual perception framework for VLMs, enabling dynamic frame retrieval and synthesis during video reasoning, leading to state-of-the-art results on multiple benchmarks.
Contribution
It develops a novel fine-tuning approach that allows VLMs to actively interleave video frames within reasoning traces, enhancing video understanding capabilities.
Findings
Achieves new state-of-the-art on VideoEspresso and ViTIB benchmarks.
Outperforms larger models on Video-MME, EgoNormia, and VCR-Bench.
Enables models to actively determine when to retrieve or generate visual evidence.
Abstract
Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
