Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang; Jiahui Yang; Jianqin Yin; Zhenbo Luo; Jian Luan

arXiv:2506.22139·cs.CV·July 23, 2025

Q-Frame: Query-aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs

Shaojie Zhang, Jiahui Yang, Jianqin Yin, Zhenbo Luo, Jian Luan

PDF

Open Access 1 Datasets

TL;DR

Q-Frame introduces an adaptive, query-aware frame selection method for Video-LLMs that enhances video comprehension by efficiently capturing crucial spatiotemporal information without increasing computational costs.

Contribution

The paper presents a training-free, plug-and-play approach for adaptive frame selection and multi-resolution scaling tailored to specific queries in video understanding.

Findings

01

Q-Frame outperforms existing methods on benchmark datasets.

02

It enables processing more frames without additional computational costs.

03

Demonstrates versatility across various video understanding tasks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated significant success in visual understanding tasks. However, challenges persist in adapting these models for video comprehension due to the large volume of data and temporal complexity. Existing Video-LLMs using uniform frame sampling often struggle to capture the query-related crucial spatiotemporal clues of videos effectively. In this paper, we introduce Q-Frame, a novel approach for adaptive frame selection and multi-resolution scaling tailored to the video's content and the specific query. Q-Frame employs a training-free, plug-and-play strategy generated by a text-image matching network like CLIP, utilizing the Gumbel-Max trick for efficient frame selection. Q-Frame allows Video-LLMs to process more frames without exceeding computational limits, thereby preserving critical temporal and spatial information. We demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

QuanzhuNiu/MeViS-Qframe
dataset· 29 dl
29 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Video Coding and Compression Technologies · Advanced Vision and Imaging

MethodsContrastive Language-Image Pre-training