RAVU: Retrieval Augmented Video Understanding with Compositional   Reasoning over Graph

Sameer Malik; Moyuru Yamada; Ayush Singh; Dishank Aggarwal

arXiv:2505.03173·cs.CV·May 7, 2025

RAVU: Retrieval Augmented Video Understanding with Compositional Reasoning over Graph

Sameer Malik, Moyuru Yamada, Ayush Singh, Dishank Aggarwal

PDF

Open Access

TL;DR

RAVU introduces a retrieval-augmented framework that uses compositional reasoning over a spatio-temporal graph to improve understanding of long videos and answer complex queries more accurately.

Contribution

It presents a novel graph-based retrieval method with reasoning capabilities for enhanced long video comprehension in large multi-modal models.

Findings

01

Outperforms state-of-the-art methods on NExT-QA and EgoSchema datasets.

02

Effective with limited retrieved frames (5-10).

03

Enables multi-hop reasoning and object tracking across frames.

Abstract

Comprehending long videos remains a significant challenge for Large Multi-modal Models (LMMs). Current LMMs struggle to process even minutes to hours videos due to their lack of explicit memory and retrieval mechanisms. To address this limitation, we propose RAVU (Retrieval Augmented Video Understanding), a novel framework for video understanding enhanced by retrieval with compositional reasoning over a spatio-temporal graph. We construct a graph representation of the video, capturing both spatial and temporal relationships between entities. This graph serves as a long-term memory, allowing us to track objects and their actions across time. To answer complex queries, we decompose the queries into a sequence of reasoning steps and execute these steps on the graph, retrieving relevant key information. Our approach enables more accurate understanding of long videos, particularly for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques