KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs
Baiyang Song, Jun Peng, Yuxin Zhang, Guangyao Chen, Feidiao Yang, Jianyuan Guo

TL;DR
KTV introduces a two-stage method for efficient training-free video understanding by selecting keyframes and tokens, reducing redundancy and computational load while improving accuracy on video question-answering tasks.
Contribution
The paper proposes KTV, a novel framework that combines question-agnostic keyframe clustering with token importance-based pruning to enhance training-free video comprehension.
Findings
Outperforms state-of-the-art training-free methods on Multiple-Choice VideoQA.
Uses only 504 visual tokens for a 60-minute video, significantly reducing computational cost.
Achieves 44.8% accuracy on the MLVU-Test benchmark.
Abstract
Training-free video understanding leverages the strong image comprehension capabilities of pre-trained vision language models (VLMs) by treating a video as a sequence of static frames, thus obviating the need for costly video-specific training. However, this paradigm often suffers from severe visual redundancy and high computational overhead, especially when processing long videos. Crucially, existing keyframe selection strategies, especially those based on CLIP similarity, are prone to biases and may inadvertently overlook critical frames, resulting in suboptimal video comprehension. To address these significant challenges, we propose \textbf{KTV}, a novel two-stage framework for efficient and effective training-free video understanding. In the first stage, KTV performs question-agnostic keyframe selection by clustering frame-level visual features, yielding a compact, diverse, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
