Query-Conditioned Evidential Keyframe Sampling for MLLM-Based Long-Form Video Understanding
Yiheng Wang, Lichen Zhu, Yueqian Lin, Yudong Liu, Jingyang Zhang, Hai "Helen" Li, Yiran Chen

TL;DR
This paper introduces an evidence-driven keyframe sampling method for long-form video understanding using MLLMs, optimizing frame selection based on information theory to improve efficiency and accuracy.
Contribution
It proposes a novel, principled keyframe sampling framework grounded in information bottleneck theory, with a query-conditioned evidence scoring network for efficient selection.
Findings
Outperforms prior sampling strategies under strict token budgets.
Improves training efficiency for long-form video understanding.
Achieves better accuracy on benchmarks compared to existing methods.
Abstract
Multimodal Large Language Models (MLLMs) have shown strong performance on video question answering, but their application to long-form videos is constrained by limited context length and computational cost, making keyframe sampling essential. Existing approaches typically rely on semantic relevance or reinforcement learning, which either fail to capture evidential clues or suffer from inefficient combinatorial optimization. In this work, we propose an evidence-driven keyframe sampling framework grounded in information bottleneck theory. We formulate keyframe selection as maximizing the conditional mutual information between selected frames and the query, providing a principled objective that reflects each frame's contribution to answering the question. To make this objective tractable, we exploit its structure to derive a decomposed optimization that reduces subset selection to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
