Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

Shaoguang Wang; Weiyu Guo; Ziyang Chen; Xuming Hu; Hui Xiong

arXiv:2604.17422·cs.CV·April 21, 2026

Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding

Shaoguang Wang, Weiyu Guo, Ziyang Chen, Xuming Hu, Hui Xiong

PDF

TL;DR

The paper introduces Q-Gate, a dynamic, query-modulated framework for selecting keyframes in long videos, improving multimodal reasoning by intelligently routing modalities based on query intent without additional training.

Contribution

Q-Gate is a novel, training-free approach that dynamically allocates attention to different expert streams for keyframe selection based on query context, enhancing video understanding.

Findings

01

Q-Gate outperforms state-of-the-art methods on LongVideoBench and Video-MME.

02

It effectively suppresses modality-specific noise in multimodal video reasoning.

03

The approach is highly interpretable and adaptable across multiple MLLM backbones.

Abstract

Long video understanding remains a formidable challenge for Multimodal Large Language Models (MLLMs) due to the prohibitive computational cost of processing dense frame sequences. Prevailing solutions, which select a keyframe subset, typically rely on either a single visual-centric metric (e.g., CLIP similarity) or a static fusion of heuristic scores. This ``one-size-fits-all'' paradigm frequently fails: visual-only metrics are ineffective for plot-driven narrative queries, while indiscriminately incorporating textual scores introduces severe ``modal noise'' for purely visual tasks. To break this bottleneck, we propose Q-Gate, a plug-and-play and training-free framework that treats keyframe selection as a dynamic modality routing problem. We decouple the retrieval process into three lightweight expert streams: Visual Grounding for local details, Global Matching for scene semantics, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.