PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

Sikuan Yan; Sicheng Dong; Haotong Wang; Ercong Nie; Yilun Liu; Jinhe Bi; Yingjie Xu; Susanna Schwarzmann; Riccardo Trivisonno; Volker Tresp; Yunpu Ma

arXiv:2605.17065·cs.MA·May 19, 2026

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

Sikuan Yan, Sicheng Dong, Haotong Wang, Ercong Nie, Yilun Liu, Jinhe Bi, Yingjie Xu, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma

PDF

TL;DR

PyraVid introduces a hierarchical multimodal memory system inspired by cognitive science, enabling long-term reasoning over videos by structuring memory access and evidence aggregation, thus improving understanding across various benchmarks.

Contribution

The paper proposes PyraVid, a novel hierarchical multimodal memory framework that effectively manages long videos for reasoning tasks, addressing challenges in heterogeneous input integration and evidence aggregation.

Findings

01

PyraVid consistently improves performance across multiple long-video benchmarks.

02

Hierarchical memory structure enables effective long-horizon reasoning.

03

Structure-guided memory expansion reduces noise and enhances relevant event retrieval.

Abstract

Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.