Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning
Jiazheng Li, Chi-Hao Wu, Yunze Liu, Kaize Ding, Jundong Li, Chuxu Zhang

TL;DR
MAGIC-Video introduces a multimodal memory graph and narrative chain to enable long-range reasoning over ultra-long videos, improving retrieval and understanding across days or weeks.
Contribution
It presents a training-free framework that unifies episodic, semantic, and visual content for ultra-long video reasoning, outperforming existing baselines.
Findings
Outperforms strong baselines on EgoLifeQA, Ego-R1, and MM-Lifelong benchmarks.
Achieves 10.1, 7.4, and 5.9 point improvements over prior systems.
Effectively handles modality and time dimensions in ultra-long video understanding.
Abstract
Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbf{MAGIC-Video}, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
