RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph
Sameer Malik, Moyuru Yamada, Ayush Singh, Dishank Aggarwal

TL;DR
RAVU introduces a retrieval-augmented framework that uses compositional reasoning over a spatio-temporal graph to improve understanding of long videos and answer complex queries more accurately.
Contribution
It presents a novel graph-based retrieval method with reasoning capabilities for enhanced long video comprehension in large multi-modal models.
Findings
Outperforms state-of-the-art methods on NExT-QA and EgoSchema datasets.
Effective with limited retrieved frames (5-10).
Enables multi-hop reasoning and object tracking across frames.
Abstract
Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
