SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding

Nianbo Zeng; Haowen Hou; Fei Richard Yu; Si Shi; Ying Tiffany He

arXiv:2506.07600·cs.CV·June 10, 2025

SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding

Nianbo Zeng, Haowen Hou, Fei Richard Yu, Si Shi, Ying Tiffany He

PDF

Open Access

TL;DR

SceneRAG introduces a scene-level retrieval-augmented generation framework for video understanding, effectively capturing long-range dependencies by segmenting videos into coherent scenes and fusing visual and textual data.

Contribution

The paper proposes SceneRAG, a novel approach that segments videos into narrative-consistent scenes and integrates multi-modal information for improved long-form video understanding.

Findings

01

Outperforms prior methods on the LongerVideos benchmark

02

Achieves up to 72.5% win rate on generation tasks

03

Effectively captures long-range dependencies in videos

Abstract

Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Attention Is All You Need · WordPiece · Weight Decay · Multi-Head Attention · Attention Dropout · Dropout · Dense Connections