Vision-Language Memory for Spatial Reasoning
Zuntao Liu, Yi Du, Taimeng Fu, Shaoshu Su, Cherie Ho, Chen Wang

TL;DR
VLM$^2$ introduces a persistent memory architecture for vision-language models, significantly improving spatial reasoning in videos by maintaining 3D-aware understanding over time, and achieving state-of-the-art results.
Contribution
The paper proposes VLM$^2$, a novel model with dual-memory modules that enhance long-term spatial reasoning from 2D videos, addressing semantic-geometric misalignment and memory limitations.
Findings
Achieves state-of-the-art performance on multiple spatial reasoning benchmarks.
Demonstrates effective long-horizon reasoning with fixed computational cost.
Outperforms existing video-only models in spatial understanding tasks.
Abstract
Spatial reasoning is a critical capability for intelligent robots, yet current vision-language models (VLMs) still fall short of human-level performance in video-based spatial reasoning. This gap mainly stems from two challenges: a semantic-geometric misalignment that prevents consistent 3D understanding, and the absence of persistent memory to retain 3D representation and understanding over time. To address these limitations, we present VLM, a Vision-Language Model with persistent Memory for spatial reasoning with a view-consistent, 3D-aware representation purely from 2D video. Specifically, to enhance long-horizon reasoning, we incorporate a dual-memory module, consisting of a working memory that operates as a sliding window to focus on immediate context, and an episodic memory that consolidates and stores critical long-term information. This design enables efficient and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Constraint Satisfaction and Optimization · Robotics and Sensor-Based Localization
