VSI: Visual Subtitle Integration for Keyframe Selection to enhance Long Video Understanding
Jianxiang He, Meisheng Hong, Jungang Li, Weiyu Guo, Xuming Hu, Hui Xiong

TL;DR
VSI is a multimodal keyframe retrieval framework that combines visual and subtitle information to improve long video understanding and keyframe selection, outperforming existing methods in accuracy and generalization.
Contribution
The paper introduces VSI, a novel multimodal approach integrating visual and subtitle data for more accurate and adaptable keyframe retrieval in long videos.
Findings
VSI achieves state-of-the-art accuracy in keyframe retrieval.
VSI delivers breakthrough performance in text-related tasks.
VSI demonstrates strong generalization across various tasks.
Abstract
Multimodal large language models (MLLMs) demonstrate exceptional performance in vision-language tasks, yet their processing of long videos is constrained by input context length and high computational costs. Sparse frame sampling thus becomes a necessary preprocessing step, with sampled frame quality directly impacting downstream performance. Existing keyframe search algorithms achieve a balance between efficiency and sampled frame quality but heavily rely on the visual modality alone. This makes them difficult to adapt to text-related tasks and often leads to retrieval results deviating from core semantic content. To address this, we propose the VISUAL-SUBTITLE INTEGRATION (VSI), a multimodal keyframe retrieval framework. It employs a dual-branch collaborative retrieval approach combining Video Search and Subtitle Match to fuse complementary visual and textual information for precise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
