TL;DR
This paper introduces Visual Agentic Memory (VAM), a novel, training-free framework that enhances long video understanding by enabling online indexing, hierarchical memory organization, and agentic retrieval of visual evidence.
Contribution
VAM is a new framework that explicitly manages visual memory for long videos, improving retrieval and reasoning over extended temporal horizons without additional training.
Findings
VAM achieves the highest RT+BT average (68.41) on OVO-Bench among baselines.
On MM-Lifelong train@month, VAM reaches 17.11%, second only to ReMA with GPT-5.
VAM demonstrates that explicit, inspectable visual memory benefits long-horizon video understanding.
Abstract
Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
