VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

Honghao Fu; Miao Xu; Yiwei Wang; Dailing Zhang; Jun Liu; Yujun Cai

arXiv:2604.05418·cs.CV·April 17, 2026

VideoStir: Understanding Long Videos via Spatio-Temporally Structured and Intent-Aware RAG

Honghao Fu, Miao Xu, Yiwei Wang, Dailing Zhang, Jun Liu, Yujun Cai

PDF

1 Repo

TL;DR

VideoStir introduces a structured, intent-aware retrieval framework for long videos, leveraging spatio-temporal graphs and a large-scale dataset to improve reasoning over traditional methods.

Contribution

It proposes a novel long-video RAG framework that structures videos as spatio-temporal graphs and incorporates intent-aware reasoning, supported by a new dataset IR-600K.

Findings

01

VideoStir achieves competitive performance without auxiliary information.

02

Structured, intent-aware retrieval outperforms flattened semantic matching.

03

The IR-600K dataset enables effective frame-query intent alignment.

Abstract

Scaling multimodal large language models (MLLMs) to long videos is constrained by limited context windows. While retrieval-augmented generation (RAG) is a promising remedy by organizing query-relevant visual evidence into a compact context, most existing methods (i) flatten videos into independent segments, breaking their inherent spatio-temporal structure, and (ii) depend on explicit semantic matching, which can miss cues that are implicitly relevant to the query's intent. To overcome these limitations, we propose VideoStir, a structured and intent-aware long-video RAG framework. It firstly structures a video as a spatio-temporal graph at clip level, and then performs multi-hop retrieval to aggregate evidence across distant yet contextually related events. Furthermore, it introduces an MLLM-backed intent-relevance scorer that retrieves frames based on their alignment with the query's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

RomGai/VideoStir
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.