E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation

Zeyu Xu; Junkang Zhang; Qiang Wang; Yi Liu

arXiv:2508.01546·cs.CV·August 5, 2025

E-VRAG: Enhancing Long Video Understanding with Resource-Efficient Retrieval Augmented Generation

Zeyu Xu, Junkang Zhang, Qiang Wang, Yi Liu

PDF

Open Access

TL;DR

E-VRAG is a resource-efficient retrieval-augmented generation framework that improves long video understanding by reducing computational costs and enhancing accuracy through hierarchical filtering, lightweight scoring, and multi-view question answering.

Contribution

The paper introduces E-VRAG, a novel framework combining hierarchical frame filtering, lightweight VLM scoring, and a global score-based retrieval strategy for efficient long video understanding.

Findings

01

Achieves 70% reduction in computational cost

02

Outperforms baseline methods in accuracy

03

Operates without additional training

Abstract

Vision-Language Models (VLMs) have enabled substantial progress in video understanding by leveraging cross-modal reasoning capabilities. However, their effectiveness is limited by the restricted context window and the high computational cost required to process long videos with thousands of frames. Retrieval-augmented generation (RAG) addresses this challenge by selecting only the most relevant frames as input, thereby reducing the computational burden. Nevertheless, existing video RAG methods struggle to balance retrieval efficiency and accuracy, particularly when handling diverse and complex video content. To address these limitations, we propose E-VRAG, a novel and efficient video RAG framework for video understanding. We first apply a frame pre-filtering method based on hierarchical query decomposition to eliminate irrelevant frames, reducing computational costs at the data level.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition