SpatialMem: Metric-Aligned Long-Horizon Video Memory for Language Grounding and QA
Xinyi Zheng, Yunze Liu, Chi-Hao Wu, Fan Zhang, Hao Zheng, Wenqi Zhou, Walterio W. Mayol-Cuevas, Junxiao Shen

TL;DR
SpatialMem is a novel memory system that builds a metric-aligned spatial scaffold from egocentric video, enabling interpretable, spatially grounded retrieval and question answering for indoor scenes without specialized sensors.
Contribution
The paper introduces SpatialMem, a hierarchical, metric-aligned memory system for long-horizon, language-grounded video understanding and retrieval in indoor environments.
Findings
Maintains stable layout reasoning and retrieval across cluttered scenes.
Enhances path-level grounding with two-layer description memory.
Limited degradation under moderate scale perturbation.
Abstract
We present SpatialMem, a memory-centric system for long-horizon, language-grounded retrieval and QA from egocentric video, where metric 3D serves as an interpretable indexing scaffold rather than an explicit mapping objective. Starting from casually captured egocentric RGB video, SpatialMem builds a metric-aligned spatial scaffold for indoor scenes, detects structural 3D anchors (walls, doors, windows) as first-layer support, and populates a hierarchical memory with open-vocabulary object nodes that link evidence patches, visual embeddings, and two-layer textual descriptions to 3D coordinates for compact storage and fast retrieval. This design enables interpretable, spatially grounded queries over relations (e.g., distance, direction, visibility) and supports downstream tasks such as language-guided retrieval/QA and offline navigation-style guidance over a prebuilt memory, without…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Spatial Cognition and Navigation
