FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You

TL;DR
FOCUS is a training-free, model-agnostic keyframe selection method that efficiently identifies informative frames in long videos, significantly improving accuracy in long-video question answering while processing less than 2% of frames.
Contribution
It introduces a novel combinatorial pure-exploration approach for keyframe selection, enabling scalable long-video understanding with theoretical guarantees.
Findings
Achieves 11.9% accuracy gain on LongVideoBench for videos over 20 minutes.
Processes less than 2% of video frames while maintaining high accuracy.
Provides a simple, general solution for scalable long-video understanding with MLLMs.
Abstract
Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is original in formulating keyframe selection for long-video understanding as a combinatorial pure-exploration multi-armed bandit problem. This is a novel and reasonable perspective that provides new theoretical and algorithmic insights for researchers working on efficient video representation and token budgeting. 2. The proposed two-stage coarse-to-fine procedure effectively addresses the non-parallelizable nature of sequential arm-pulling and updating, providing a practical soluti
1. In Section 2.2, the paper assumes that “frame-level utility within the same arm share the same distribution.” It is unclear how this assumption is ensured in practice, especially regarding how the M non-overlapping fixed-length clips are partitioned. For instance, when the video contains shot changes or scene transitions, it is not clear how these are handled or whether the authors explored alternative segmentation strategies. 2. The experiments are limited to LongVideoBench and Video-MME, bo
- The method seems quiet effective in selecting the frames. - Seem to work well across different MLLMs - Better accuracy over the AKS method while improving efficiency.
- The authors present the method as model-agnostic; however, they appear to leverage BLIP for frame relevance scoring to compute their latent frame-level utility. Even if only 1% of the frames are processed through BLIP, it still relies on a model, making the claim of model-agnosticism questionable. This point should have been better explained in the paper. - Lack of comparison with training-based method. - Could have added more benchmarks such as MLVU, NextQA. MVBench - Typos, abstract "within
- **Conceptually elegant and efficient.** The formulation connects keyframe selection with variance-adaptive exploration in MABs, offering a lightweight theoretical lens for efficient inference. The two-stage batched procedure is practical and easily parallelizable, reducing GPU cost by 40–60% compared to AKS. - **Training-free and modular.** The pipeline is plug-and-play, requires no fine-tuning, and integrates smoothly into existing LVLM inference workflows. - **Empirically consistent.** Eva
- **Incremental novelty.** The “combinatorial pure-exploration bandit” framing is conceptually sound but reuses standard UCB-V principles with minimal adaptation to video reasoning. Similar adaptive sampling ideas have appeared in AKS, Q-Frame, T* and Frame-Voyager. - **Weak theoretical substance & Limited methodological depth.** The regret bound assumes i.i.d. frame rewards and bounded noise, which do not hold in temporally correlated videos. The theoretical claim does not extend to frame-leve
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
