E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation
Zeyu Xu, Junkang Zhang, Qiang Wang, Yi Liu

TL;DR
E-VRAG is a resource-efficient retrieval-augmented generation framework that improves long video understanding by reducing computational costs and enhancing accuracy through hierarchical filtering, lightweight scoring, and multi-view question answering.
Contribution
The paper introduces E-VRAG, a novel framework combining hierarchical frame filtering, lightweight VLM scoring, and a global score-based retrieval strategy for efficient long video understanding.
Findings
Achieves 70% reduction in computational cost
Outperforms baseline methods in accuracy
Operates without additional training
Abstract
Vision-Language Models (VLMs) have enabled substantial progress in video understanding by leveraging cross-modal reasoning capabilities. However, their effectiveness is limited by the restricted context window and the high computational cost required to process long videos with thousands of frames. Retrieval-augmented generation (RAG) addresses this challenge by selecting only the most relevant frames as input, thereby reducing the computational burden. Nevertheless, existing video RAG methods struggle to balance retrieval efficiency and accuracy, particularly when handling diverse and complex video content. To address these limitations, we propose E-VRAG, a novel and efficient video RAG framework for video understanding. We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames, reducing computational costs at the data level.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
