TL;DR
VideoStir introduces a structured, intent-aware retrieval framework for long videos, leveraging spatio-temporal graphs and a large-scale dataset to improve reasoning over traditional methods.
Contribution
It proposes a novel long-video RAG framework that structures videos as spatio-temporal graphs and incorporates intent-aware reasoning, supported by a new dataset IR-600K.
Findings
VideoStir achieves competitive performance without auxiliary information.
Structured, intent-aware retrieval outperforms flattened semantic matching.
The IR-600K dataset enables effective frame-query intent alignment.
Abstract
Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
