LASAR: Towards Spatio-temporal Reasoning with Latent Cognitive Map
Jinzhou Tang, Sidi Liu, Waikit Xiu, Weixing Chen, Keze Wang

TL;DR
LASAR introduces a dual-memory architecture with a contrastive learning objective to enhance spatio-temporal reasoning and internal spatial modeling in embodied AI agents, improving zero-shot generalization.
Contribution
The paper proposes LASAR, a novel architecture with a dual-memory system and ST-CRL training to explicitly encode spatial relationships from experiences.
Findings
Achieves 2-3.5% improvements in zero-shot generalization on VLN-CE and VSI-Bench.
Demonstrates high self-consistency of the cognitive map.
Introduces a contrastive learning method leveraging spatio-temporal cues.
Abstract
A fundamental challenge in embodied AI is verifying if agents build internal models of spatial structure or merely learn to mimic task-specific expert trajectories. This is critical as foundational approaches rooted in action-centric tasks (e.g., VLN) and reasoning-centric tasks (e.g., EQA) often share a common limitation: they lack a learning signal that forces them to encode fine-grained spatial relationships (like topology or distance) over long-range, fragmented experiences. To address this, we first propose LASAR, an architecture featuring a dual-memory system designed to maintain both episodic experiences and a semantic cognitive map. We then introduce Spatio-temporal Contextual Representation Learning (ST-CRL), a contrastive objective designed to train this architecture. ST-CRL leverages spatio-temporal cues from cognitive queries generated through annotated spatio-temporal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
