HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Yiqing Yang; Kin-Man Lam

arXiv:2512.11534·cs.CV·December 15, 2025

HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Yiqing Yang, Kin-Man Lam

PDF

Open Access

TL;DR

This paper introduces an end-to-end trainable, task-adaptive frame selection framework for video reasoning that dynamically optimizes frame relevance, coverage, and redundancy, outperforming existing methods across multiple benchmarks.

Contribution

It proposes a novel holistic, query-aware frame selection method using a Chain-of-Thought guided language model and differentiable set-level optimization, eliminating reliance on static pseudo labels.

Findings

01

Significantly outperforms existing methods on multiple benchmarks.

02

Effectively balances relevance, coverage, and redundancy in frame selection.

03

Enables dynamic, task-specific frame selection through end-to-end training.

Abstract

Key frame selection in video understanding presents significant challenges. Traditional top-K selection methods, which score frames independently, often fail to optimize the selection as a whole. This independent scoring frequently results in selecting frames that are temporally clustered and visually redundant. Additionally, training lightweight selectors using pseudo labels generated offline by Multimodal Large Language Models (MLLMs) prevents the supervisory signal from dynamically adapting to task objectives. To address these limitations, we propose an end-to-end trainable, task-adaptive framework for frame selection. A Chain-of-Thought approach guides a Small Language Model (SLM) to generate task-specific implicit query vectors, which are combined with multimodal features to enable dynamic frame scoring. We further define a continuous set-level objective function that incorporates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition