Where to Focus: Query-Modulated Multimodal Keyframe Selection for Long Video Understanding
Shaoguang Wang, Weiyu Guo, Ziyang Chen, Xuming Hu, Hui Xiong

TL;DR
The paper introduces Q-Gate, a dynamic, query-modulated framework for selecting keyframes in long videos, improving multimodal reasoning by intelligently routing modalities based on query intent without additional training.
Contribution
Q-Gate is a novel, training-free approach that dynamically allocates attention to different expert streams for keyframe selection based on query context, enhancing video understanding.
Findings
Q-Gate outperforms state-of-the-art methods on LongVideoBench and Video-MME.
It effectively suppresses modality-specific noise in multimodal video reasoning.
The approach is highly interpretable and adaptable across multiple MLLM backbones.
Abstract
Long video understanding remains a formidable challenge for Multimodal Large Language Models (MLLMs) due to the prohibitive computational cost of processing dense frame sequences. Prevailing solutions, which select a keyframe subset, typically rely on either a single visual-centric metric (e.g., CLIP similarity) or a static fusion of heuristic scores. This ``one-size-fits-all'' paradigm frequently fails: visual-only metrics are ineffective for plot-driven narrative queries, while indiscriminately incorporating textual scores introduces severe ``modal noise'' for purely visual tasks. To break this bottleneck, we propose Q-Gate, a plug-and-play and training-free framework that treats keyframe selection as a dynamic modality routing problem. We decouple the retrieval process into three lightweight expert streams: Visual Grounding for local details, Global Matching for scene semantics, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
