VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context   Videos

Xubin Ren; Lingrui Xu; Long Xia; Shuaiqiang Wang; Dawei Yin; Chao; Huang

arXiv:2502.01549·cs.IR·February 4, 2025·2 cites

VideoRAG: Retrieval-Augmented Generation with Extreme Long-Context Videos

Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, Chao, Huang

PDF

Open Access 1 Repo 1 Datasets

TL;DR

VideoRAG introduces a novel retrieval-augmented generation framework tailored for extremely long videos, combining graph-based knowledge grounding and multi-modal encoding to enhance understanding and processing of multi-hour video content.

Contribution

It is the first framework to enable retrieval-augmented processing of long videos using a dual-channel architecture for semantic and visual integration.

Findings

01

Outperforms existing methods on the LongerVideos benchmark

02

Successfully processes over 160 videos totaling 134+ hours

03

Demonstrates effective multi-modal and cross-video knowledge integration

Abstract

Retrieval-Augmented Generation (RAG) has demonstrated remarkable success in enhancing Large Language Models (LLMs) through external knowledge integration, yet its application has primarily focused on textual content, leaving the rich domain of multi-modal video knowledge predominantly unexplored. This paper introduces VideoRAG, the first retrieval-augmented generation framework specifically designed for processing and understanding extremely long-context videos. Our core innovation lies in its dual-channel architecture that seamlessly integrates (i) graph-based textual knowledge grounding for capturing cross-video semantic relationships, and (ii) multi-modal context encoding for efficiently preserving visual features. This novel design empowers VideoRAG to process unlimited-length videos by constructing precise knowledge graphs that span multiple videos while maintaining semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hkuds/videorag
pytorchOfficial

Datasets

atad-tokyo/GST_EGOSERVE
dataset· 22k dl
22k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Weight Decay · Linear Layer · Byte Pair Encoding · WordPiece · Layer Normalization · Residual Connection · Dense Connections · Attention Dropout