SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

Xinyi Zheng; Yunze Liu; Chi-Hao Wu; Fan Zhang; Hao Zheng; Wenqi Zhou; Walterio W. Mayol-Cuevas; Junxiao Shen

arXiv:2601.14895·cs.CV·March 9, 2026

SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA

Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen

PDF

Open Access

TL;DR

SpatialMem is a novel memory system that builds a metric-aligned spatial scaffold from egocentric video, enabling interpretable, spatially grounded retrieval and question answering for indoor scenes without specialized sensors.

Contribution

The paper introduces SpatialMem, a hierarchical, metric-aligned memory system for long-horizon, language-grounded video understanding and retrieval in indoor environments.

Findings

01

Maintains stable layout reasoning and retrieval across cluttered scenes.

02

Enhances path-level grounding with two-layer description memory.

03

Limited degradation under moderate scale perturbation.

Abstract

We present SpatialMem, a memory-centric system for long-horizon, language-grounded retrieval and QA from egocentric video, where metric 3D serves as an interpretable indexing scaffold rather than an explicit mapping objective. Starting from casually captured egocentric RGB video, SpatialMem builds a metric-aligned spatial scaffold for indoor scenes, detects structural 3D anchors (walls, doors, windows) as first-layer support, and populates a hierarchical memory with open-vocabulary object nodes that link evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates for compact storage and fast retrieval. This design enables interpretable, spatially grounded queries over relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided retrieval/QA and offline navigation-style guidance over a prebuilt memory, without…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Spatial Cognition and Navigation