PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning
Sikuan Yan, Sicheng Dong, Haotong Wang, Ercong Nie, Yilun Liu, Jinhe Bi, Yingjie Xu, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma

TL;DR
PyraVid introduces a hierarchical multimodal memory system inspired by cognitive science, enabling long-term reasoning over videos by structuring memory access and evidence aggregation, thus improving understanding across various benchmarks.
Contribution
The paper proposes PyraVid, a novel hierarchical multimodal memory framework that effectively manages long videos for reasoning tasks, addressing challenges in heterogeneous input integration and evidence aggregation.
Findings
PyraVid consistently improves performance across multiple long-video benchmarks.
Hierarchical memory structure enables effective long-horizon reasoning.
Structure-guided memory expansion reduces noise and enhances relevant event retrieval.
Abstract
Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
