SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

Chih-Ting Liao; Xi Xiao; Chunlei Meng; Zhangquan Chen; Yitong Qiao; Weilin Zhou; Tianyang Wang; Xu Zheng; Xin Cao

arXiv:2604.22409·cs.CV·April 27, 2026

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao

PDF

TL;DR

SpaMEM is a comprehensive benchmark for evaluating spatial reasoning and memory in embodied AI environments, highlighting current models' limitations in long-term spatial belief maintenance.

Contribution

Introduces SpaMEM, a large-scale diagnostic benchmark with a novel three-level hierarchy to assess embodied spatial reasoning and memory in multimodal models.

Findings

01

Models struggle with coordinate grounding and long-term memory.

02

Text-based bookkeeping outperforms visual memory in current models.

03

Benchmark reveals a sharp performance drop from temporal reasoning to belief maintenance.

Abstract

Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.