VideoBrain: Learning Adaptive Frame Sampling for Long Video Understanding
Junbo Zou, Ziheng Huang, Shengjie Zhang, Liwen Zhang, and Weining Shen

TL;DR
VideoBrain introduces an adaptive frame sampling framework for long video understanding, enabling vision-language models to efficiently select informative frames through learned policies, improving accuracy while reducing computational costs.
Contribution
It presents a novel end-to-end adaptive sampling method with dual agents and a behavior-aware reward, enhancing long video comprehension over prior uniform or single-pass sampling approaches.
Findings
Achieves 3.5% to 9.0% accuracy improvements on benchmarks.
Uses 30-40% fewer frames than baseline methods.
Demonstrates strong generalization across datasets.
Abstract
Long-form video understanding remains challenging for Vision-Language Models (VLMs) due to the inherent tension between computational constraints and the need to capture information distributed across thousands of frames. Existing approaches either sample frames uniformly (risking information loss) or select keyframes in a single pass (with no recovery from poor choices). We propose VideoBrain, an end-to-end framework that enables VLMs to adaptively acquire visual information through learned sampling policies. Our approach features dual complementary agents: a CLIP-based agent for semantic retrieval across the video and a Uniform agent for dense temporal sampling within intervals. Unlike prior agent-based methods that rely on text-only LLMs orchestrating visual tools, our VLM directly perceives frames and reasons about information sufficiency. To prevent models from invoking agents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
