Act2See: Emergent Active Visual Perception for Video Reasoning

Martin Q. Ma; Yuxiao Qu; Aditya Agrawal; Willis Guo; Paul Pu Liang; Ruslan Salakhutdinov; Louis-Philippe Morency

arXiv:2605.01657·cs.CV·May 5, 2026

Act2See: Emergent Active Visual Perception for Video Reasoning

Martin Q. Ma, Yuxiao Qu, Aditya Agrawal, Willis Guo, Paul Pu Liang, Ruslan Salakhutdinov, Louis-Philippe Morency

PDF

TL;DR

Act2See introduces an active visual perception framework for VLMs, enabling dynamic frame retrieval and synthesis during video reasoning, leading to state-of-the-art results on multiple benchmarks.

Contribution

It develops a novel fine-tuning approach that allows VLMs to actively interleave video frames within reasoning traces, enhancing video understanding capabilities.

Findings

01

Achieves new state-of-the-art on VideoEspresso and ViTIB benchmarks.

02

Outperforms larger models on Video-MME, EgoNormia, and VCR-Bench.

03

Enables models to actively determine when to retrieve or generate visual evidence.

Abstract

Vision-Language Models (VLMs) typically rely on static initial frames for video reasoning, restricting their ability to incorporate essential dynamic information as the reasoning process evolves. Existing methods that augment Chain-of-Thought (CoT) with additional frame information often exhibit suboptimal CoT quality and lack the crucial ability to synthesize visual information for hypothetical or counterfactual scenarios. We introduce Act-to-See (Act2See), a novel framework that enables active visual perception by empowering VLMs to actively interleave video frames within text CoTs. Act2See is developed via Supervised Fine-Tuning (SFT) on a high-quality dataset of reasoning traces generated by a frontier VLM. These traces integrate active calls to either retrieve existing frames or generate new ones, and are rigorously verified against human-annotated CoTs to ensure quality. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.