Towards Sparse Video Understanding and Reasoning

Chenwei Xu; Zhen Ye; Shang Wu; Weijian Li; Zihan Wang; Zhuofan Xia; Lie Lu; Pranav Maneriker; Fan Du; Manling Li; Han Liu

arXiv:2602.13602·cs.CV·February 17, 2026

Towards Sparse Video Understanding and Reasoning

Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

PDF

Open Access

TL;DR

The paper introduces evise, a multi-round agent for video question answering that selectively samples informative frames, maintains a summary state, and supports reinforcement fine-tuning, leading to more efficient and accurate video reasoning.

Contribution

It proposes a novel sparse video reasoning method with a plug-and-play design and introduces EAGER, a new reward for reinforcement fine-tuning of vision-language models.

Findings

01

Improves accuracy on multiple VQA benchmarks.

02

Reduces frames, rounds, and prompt tokens needed.

03

Enables reinforcement fine-tuning with EAGER reward.

Abstract

We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning