VideoRAG: Retrieval-Augmented Generation over Video Corpus

Soyeong Jeong; Kangsan Kim; Jinheon Baek; Sung Ju Hwang

arXiv:2501.05874·cs.CV·May 30, 2025

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Soyeong Jeong, Kangsan Kim, Jinheon Baek, Sung Ju Hwang

PDF

Open Access 1 Repo 1 Video

TL;DR

VideoRAG introduces a retrieval-augmented framework for videos that dynamically retrieves relevant videos and utilizes both visual and textual data, enhancing multimodal response generation with large video language models.

Contribution

It presents a novel retrieval-augmented approach for videos, incorporating dynamic retrieval, multimodal processing, and frame selection strategies using large video language models.

Findings

01

Outperforms relevant baselines in response quality

02

Effective retrieval of relevant videos based on queries

03

Improved multimodal understanding with frame selection and text extraction

Abstract

Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

starsuzi/videorag
pytorchOfficial

Videos

VideoRAG: Retrieval-Augmented Generation over Video Corpus· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Focus · Layer Normalization · Dense Connections · Linear Warmup With Linear Decay · WordPiece · Attention Dropout · Adam · Residual Connection