VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos
Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao, Huang

TL;DR
VideoRAG introduces a novel retrieval-augmented generation framework tailored for extremely long videos, combining graph-based knowledge grounding and multi-modal encoding to enhance understanding and processing of multi-hour video content.
Contribution
It is the first framework to enable retrieval-augmented processing of long videos using a dual-channel architecture for semantic and visual integration.
Findings
Outperforms existing methods on the LongerVideos benchmark
Successfully processes over 160 videos totaling 134+ hours
Demonstrates effective multi-modal and cross-video knowledge integration
Abstract
Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Human Pose and Action Recognition
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Byte Pair Encoding · WordPiece · Layer Normalization · Residual Connection · Dense Connections · Attention Dropout
