From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian; Yuting Wang; Hanshu Yao; Jinpeng Wang; Bin Chen; Yaowei Wang; Min Zhang; Shu-Tao Xia

arXiv:2603.01455·cs.CV·April 22, 2026

From Verbatim to Gist: Distilling Pyramidal Multimodal Memory via Semantic Information Bottleneck for Long-Horizon Video Agents

Niu Lian, Yuting Wang, Hanshu Yao, Jinpeng Wang, Bin Chen, Yaowei Wang, Min Zhang, Shu-Tao Xia

PDF

1 Repo

TL;DR

This paper introduces MM-Mem, a hierarchical multimodal memory system inspired by human cognition, which improves long-horizon video understanding by balancing detailed perception and semantic abstraction.

Contribution

It proposes a pyramidal memory architecture with a semantic information bottleneck and a novel retrieval strategy, advancing long-term video reasoning capabilities.

Findings

01

Achieves state-of-the-art results on 4 video understanding benchmarks.

02

Demonstrates robust generalization in long-horizon tasks.

03

Validates the effectiveness of cognition-inspired memory organization.

Abstract

While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

EliSpectre/MM-Mem
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.