HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
Yiqing Yang, Kin-Man Lam

TL;DR
This paper introduces an end-to-end trainable, task-adaptive frame selection framework for video reasoning that dynamically optimizes frame relevance, coverage, and redundancy, outperforming existing methods across multiple benchmarks.
Contribution
It proposes a novel holistic, query-aware frame selection method using a Chain-of-Thought guided language model and differentiable set-level optimization, eliminating reliance on static pseudo labels.
Findings
Significantly outperforms existing methods on multiple benchmarks.
Effectively balances relevance, coverage, and redundancy in frame selection.
Enables dynamic, task-specific frame selection through end-to-end training.
Abstract
Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
