Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

Jiazheng Li; Chi-Hao Wu; Yunze Liu; Kaize Ding; Jundong Li; Chuxu Zhang

arXiv:2605.08271·cs.CV·May 12, 2026

Bridging Modalities, Spanning Time: Structured Memory for Ultra-Long Agentic Video Reasoning

Jiazheng Li, Chi-Hao Wu, Yunze Liu, Kaize Ding, Jundong Li, Chuxu Zhang

PDF

1 Repo 1 Datasets

TL;DR

MAGIC-Video introduces a multimodal memory graph and narrative chain to enable long-range reasoning over ultra-long videos, improving retrieval and understanding across days or weeks.

Contribution

It presents a training-free framework that unifies episodic, semantic, and visual content for ultra-long video reasoning, outperforming existing baselines.

Findings

01

Outperforms strong baselines on EgoLifeQA, Ego-R1, and MM-Lifelong benchmarks.

02

Achieves 10.1, 7.4, and 5.9 point improvements over prior systems.

03

Effectively handles modality and time dimensions in ultra-long video understanding.

Abstract

Understanding ultra-long videos such as egocentric recordings, live streams, or surveillance footage spanning days to weeks, remains a challenge. For current multimodal LLMs: even with million-token context windows, frame budgets cover only tens of minutes of densely sampled video, and most evidence is discarded before inference begins. Memory-augmented and agentic approaches help with scale, but their retrieval remains fragmented across modalities and lacks long-range narrative summaries that span days or weeks. We propose \textbf{MAGIC-Video}, a training-free framework built around a multimodal memory graph with interleaved narrative chain: the graph unifies episodic, semantic, and visual content through six typed edges and supports cross-modal retrieval, while the chain distils long-horizon entity biographies and recurring activity events. At inference time, an agentic loop…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lijiazheng0917/MAGIC-video
github

Datasets

jiazhengli7/magic-video-artifacts
dataset· 148 dl
148 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.