Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs
Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan

TL;DR
Q-Frame introduces an adaptive, query-aware frame selection method for Video-LLMs that enhances video comprehension by efficiently capturing crucial spatiotemporal information without increasing computational costs.
Contribution
The paper presents a training-free, plug-and-play approach for adaptive frame selection and multi-resolution scaling tailored to specific queries in video understanding.
Findings
Q-Frame outperforms existing methods on benchmark datasets.
It enables processing more frames without additional computational costs.
Demonstrates versatility across various video understanding tasks.
Abstract
Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Video Coding and Compression Technologies · Advanced Vision and Imaging
MethodsContrastive Language-Image Pre-training
