SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding
Nianbo Zeng, Haowen Hou, Fei Richard Yu, Si Shi, Ying Tiffany He

TL;DR
SceneRAG introduces a scene-level retrieval-augmented generation framework for video understanding, effectively capturing long-range dependencies by segmenting videos into coherent scenes and fusing visual and textual data.
Contribution
The paper proposes SceneRAG, a novel approach that segments videos into narrative-consistent scenes and integrates multi-modal information for improved long-form video understanding.
Findings
Outperforms prior methods on the LongerVideos benchmark
Achieves up to 72.5% win rate on generation tasks
Effectively captures long-range dependencies in videos
Abstract
Despite recent advances in retrieval-augmented generation (RAG) for video understanding, effectively understanding long-form video content remains underexplored due to the vast scale and high complexity of video data. Current RAG approaches typically segment videos into fixed-length chunks, which often disrupts the continuity of contextual information and fails to capture authentic scene boundaries. Inspired by the human ability to naturally organize continuous experiences into coherent scenes, we present SceneRAG, a unified framework that leverages large language models to segment videos into narrative-consistent scenes by processing ASR transcripts alongside temporal metadata. SceneRAG further sharpens these initial boundaries through lightweight heuristics and iterative correction. For each scene, the framework fuses information from both visual and textual modalities to extract…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Byte Pair Encoding · Attention Is All You Need · WordPiece · Weight Decay · Multi-Head Attention · Attention Dropout · Dropout · Dense Connections
