Threading Keyframe with Narratives: MLLMs as Strong Long Video Comprehenders
Bo Fang, Wenhao Wu, Qiangqiang Wu, Yuxin Song, Antoni B. Chan

TL;DR
This paper introduces Nar-KFC, a modular approach that enhances long video understanding by selecting keyframes through optimization and supplementing with generated narratives, improving MLLMs' performance efficiently.
Contribution
It proposes a novel keyframe selection and narrative insertion method, addressing computational challenges and temporal discontinuity in long video comprehension with MLLMs.
Findings
Significant performance improvements on multiple benchmarks.
Efficient keyframe selection via greedy search.
Enhanced coherence with narrative insertion.
Abstract
Employing Multimodal Large Language Models (MLLMs) for long video understanding remains a challenging problem due to the dilemma between the substantial number of video frames (i.e., visual tokens) versus the limited context length of language models. Traditional uniform sampling often leads to selection of irrelevant content, while post-training MLLMs on thousands of frames imposes a substantial computational burden. In this paper, we propose threading keyframes with narratives (Nar-KFC), a plug-and-play module to facilitate effective and efficient long video perception. Nar-KFC generally involves two collaborative steps. First, we formulate the keyframe selection process as an integer quadratic programming problem, jointly optimizing query-relevance and frame-diversity. To avoid its computational complexity, a customized greedy search strategy is designed as an efficient alternative.…
Peer Reviews
Decision·ICLR 2026 Poster
1. Well-written and easy to follow. 2. Keyframe selection is formulated as an integer quadratic programming problem with a customized greedy search, providing clear theoretical grounding rather than relying on heuristic rules. 3. The plug-and-play design is flexible and compatible with various MLLMs, training-free, which reduces computation cost and overfitting risk. 4. Demonstrates consistent performance gains across different models (e.g., InternVL2, Qwen2-VL) and model sizes, showing its g
My main concerns are with the experiments: 1. The number of benchmarks is limited. As a training-free module, Nar-KFC needs more diverse benchmarks to convincingly demonstrate its generality and robustness. 2. The base models are outdated — the strongest one, LLaVA-Video, is already a year old. Including 1–2 more recent VLMs (e.g., Qwen2.5-VL, Qwen3-VL, Intern3-VL) would strengthen the claims. 3. Some baselines in Table 1 seem questionable; for instance, LLaVA-Video’s VideoMME performance sho
1.This paper presents Nar-KFC, a hybrid representation method that interleaves visual keyframes with textual narratives, offering a novel perspective on video compression and long-video understanding. The authors formulate keyframe selection as an Integer Quadratic Programming (IQP) problem, systematically optimizing both query relevance and frame diversity, making it a more principled alternative to heuristic approaches. 2.The proposed method holds strong practical value: it is a plug-and-play
1. A significant potential risk of this method is that the lightweight captioner used for narrative generation may introduce errors or hallucinations. More importantly, this captioner operates in a "query-agnostic" manner; it merely describes the content of non-keyframe segments, and this content may be entirely irrelevant to the user's specific query. This results in the Nar-KFC method actively injecting a substantial volume of irrelevant noise into the MLLM's context . If a given narrative hap
* The paper is clearly written and well-structured. * Nar-KFC demonstrates consistent performance gains across different MLLMs, effectively validating its utility. * Exploring improved frame sampling strategies for long video understanding is a valuable research direction.
* Unlike token compression approaches, Nar-KFC relies primarily on frame selection for information compression. The optimization criteria, i.e. frame-level diversity and query-frame relevance, are computed at the global frame level. This raises concerns about scenarios involving subtle or localized changes (e.g., small objects evolving over time), where high overall frame similarity might cause truly informative frames to be inadvertently discarded, as such nuances may not be captured by coarse
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
