Event-Anchored Frame Selection for Effective Long-Video Understanding
Wang Chen, Yongdong Luo, Yuhui Zeng, Luojun Lin, Tianyu Xie, Fei Chao, Rongrong Ji, Xiawu Zheng

TL;DR
This paper introduces Event-Anchored Frame Selection (EFS), a hierarchical method that improves long-video understanding by selecting key frames based on semantic events, enhancing accuracy of vision-language models.
Contribution
EFS is a novel, training-free, hierarchical frame selection pipeline that leverages self-supervised embeddings to improve long-video comprehension in vision-language models.
Findings
EFS improves accuracy by up to 8.8% on video understanding benchmarks.
EFS effectively captures semantic events for better frame selection.
EFS seamlessly integrates with existing LVLMs, boosting performance.
Abstract
Massive frame redundancy and limited context window make efficient frame selection crucial for long-video understanding with large vision-language models (LVLMs). Prevailing approaches, however, adopt a flat sampling paradigm which treats the video as an unstructured collection of frames. In this paper, we introduce Event-Anchored Frame Selection (EFS), a hierarchical, event-aware pipeline. Leveraging self-supervised DINO embeddings, EFS first partitions the video stream into visually homogeneous temporal segments, which serve as proxies for semantic events. Within each event, it then selects the most query-relevant frame as an anchor. These anchors act as structural priors that guide a global refinement stage using an adaptive Maximal Marginal Relevance (MMR) scheme. This pipeline ensures the final keyframe set jointly optimizes for event coverage, query relevance, and visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning
