Vision-Language Memory for Spatial Reasoning

Zuntao Liu; Yi Du; Taimeng Fu; Shaoshu Su; Cherie Ho; Chen Wang

arXiv:2511.20644·cs.CV·November 26, 2025

Vision-Language Memory for Spatial Reasoning

Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang

PDF

Open Access

TL;DR

VLM$^2$ introduces a persistent memory architecture for vision-language models, significantly improving spatial reasoning in videos by maintaining 3D-aware understanding over time, and achieving state-of-the-art results.

Contribution

The paper proposes VLM$^2$, a novel model with dual-memory modules that enhance long-term spatial reasoning from 2D videos, addressing semantic-geometric misalignment and memory limitations.

Findings

01

Achieves state-of-the-art performance on multiple spatial reasoning benchmarks.

02

Demonstrates effective long-horizon reasoning with fixed computational cost.

03

Outperforms existing video-only models in spatial understanding tasks.

Abstract

Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM $^{2}$ , a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Robotics and Sensor-Based Localization