HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai; Bishoy Galoaa; Sarah Ostadabbas

arXiv:2603.18850·cs.CV·March 20, 2026

HORNet: Task-Guided Frame Selection for Video Question Answering with Vision-Language Models

Xiangyu Bai, Bishoy Galoaa, Sarah Ostadabbas

PDF

Open Access 1 Models

TL;DR

HORNet is a lightweight, trainable frame selection policy for video question answering that significantly reduces input frames and processing time while improving answer accuracy, formalized as the Select Any Frames task.

Contribution

Introduces HORNet, a novel frame selection method trained with GRPO that enhances VQA performance and efficiency, formalizing the SAF task and demonstrating cross-model transferability.

Findings

01

HORNet reduces input frames by up to 99% and VLM processing time by up to 93%.

02

HORNet improves answer quality on short-form benchmarks (+1.7% F1) and temporal reasoning tasks (+7.3 points).

03

GRPO-trained selection generalizes better out-of-distribution than supervised and PPO methods.

Abstract

Video question answering (VQA) with vision-language models (VLMs) depends critically on which frames are selected from the input video, yet most systems rely on uniform or heuristic sampling that cannot be optimized for downstream answering quality. We introduce \textbf{HORNet}, a lightweight frame selection policy trained with Group Relative Policy Optimization (GRPO) to learn which frames a frozen VLM needs to answer questions correctly. With fewer than 1M trainable parameters, HORNet reduces input frames by up to 99\% and VLM processing time by up to 93\%, while improving answer quality on short-form benchmarks (+1.7\% F1 on MSVD-QA) and achieving strong performance on temporal reasoning tasks (+7.3 points over uniform sampling on NExT-QA). We formalize this as Select Any Frames (SAF), a task that decouples visual input curation from VLM reasoning, and show that GRPO-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
bishoygaloaa/HORnet
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling