HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models
Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas

TL;DR
HORNet is a lightweight, trainable frame selection policy for video question answering that significantly reduces input frames and processing time while improving answer accuracy, formalized as the Select Any Frames task.
Contribution
Introduces HORNet, a novel frame selection method trained with GRPO that enhances VQA performance and efficiency, formalizing the SAF task and demonstrating cross-model transferability.
Findings
HORNet reduces input frames by up to 99% and VLM processing time by up to 93%.
HORNet improves answer quality on short-form benchmarks (+1.7% F1) and temporal reasoning tasks (+7.3 points).
GRPO-trained selection generalizes better out-of-distribution than supervised and PPO methods.
Abstract
Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
