Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval

Aiden Yiliu Li; Nels Numan; Anthony Steed

arXiv:2605.16481·cs.CV·May 19, 2026

Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval

Aiden Yiliu Li, Nels Numan, Anthony Steed

PDF

1 Repo

TL;DR

This paper introduces Visual Agentic Memory (VAM), a novel, training-free framework that enhances long video understanding by enabling online indexing, hierarchical memory organization, and agentic retrieval of visual evidence.

Contribution

VAM is a new framework that explicitly manages visual memory for long videos, improving retrieval and reasoning over extended temporal horizons without additional training.

Findings

01

VAM achieves the highest RT+BT average (68.41) on OVO-Bench among baselines.

02

On MM-Lifelong train@month, VAM reaches 17.11%, second only to ReMA with GPT-5.

03

VAM demonstrates that explicit, inspectable visual memory benefits long-horizon video understanding.

Abstract

Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yiliu-li/Visual-Agentic-Memory
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.