TL;DR
This paper introduces MM-Mem, a hierarchical multimodal memory system inspired by human cognition, which improves long-horizon video understanding by balancing detailed perception and semantic abstraction.
Contribution
It proposes a pyramidal memory architecture with a semantic information bottleneck and a novel retrieval strategy, advancing long-term video reasoning capabilities.
Findings
Achieves state-of-the-art results on 4 video understanding benchmarks.
Demonstrates robust generalization in long-horizon tasks.
Validates the effectiveness of cognition-inspired memory organization.
Abstract
While multimodal large language models have demonstrated impressive short-term reasoning, they struggle with long-horizon video understanding due to limited context windows and static memory mechanisms that fail to mirror human cognitive efficiency. Existing paradigms typically fall into two extremes: vision-centric methods that incur high latency and redundancy through dense visual accumulation, or text-centric approaches that suffer from detail loss and hallucination via aggressive captioning. To bridge this gap, we propose MM-Mem, a pyramidal multimodal memory architecture grounded in Fuzzy-Trace Theory. MM-Mem structures memory hierarchically into a Sensory Buffer, Episodic Stream, and Symbolic Schema, enabling the progressive distillation of fine-grained perceptual traces (verbatim) into high-level semantic schemas (gist). Furthermore, to govern the dynamic construction of memory,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
