Adaptive Greedy Frame Selection for Long Video Understanding
Yuning Huang, Xiaoyu Ji, Joseph Huang, Yichi Zhang, Fengqing Zhu

TL;DR
This paper introduces a question-adaptive greedy frame selection method for long-video understanding that balances relevance and coverage, improving accuracy over naive sampling methods.
Contribution
It proposes a novel greedy selection algorithm that jointly optimizes relevance and coverage, with adaptive strategies for different question types, backed by theoretical guarantees.
Findings
Consistent accuracy improvements over uniform sampling.
Significant gains under tight frame budgets.
Effective question-type routing enhances performance.
Abstract
Large vision--language models (VLMs) are increasingly applied to long-video question answering, yet inference is often bottlenecked by the number of input frames and resulting visual tokens. Naive sparse sampling can miss decisive moments, while purely relevance-driven selection frequently collapses onto near-duplicate frames and sacrifices coverage of temporally distant evidence. We propose a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic representativeness under a fixed frame budget. Our approach constructs a 1~FPS candidate pool (capped at 1000) with exact timestamp alignment, embeds candidates in two complementary spaces (SigLIP for question relevance and DINOv2 for semantic similarity), and selects frames by greedily maximizing a weighted sum of a modular relevance term and a facility-location coverage term. This objective is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
