Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action
Pengteng Li, Weiyu Guo, He Zhang, Tiefu Cai, Xiao He, Yandong Guo, Hui Xiong

TL;DR
SOMA is a spatial memory framework that enhances vision-language-action models by enabling reasoning about objects outside the current visual view, improving manipulation success in real-world tasks.
Contribution
The paper introduces SOMA, a novel spatial memory system that constructs and maintains persistent spatial representations for out-of-vision manipulation in VLA models.
Findings
Improves task success rates in out-of-vision manipulation tasks.
Enables faster target localization and near one-shot grasping.
Validates effectiveness on real-world and simulated environments.
Abstract
We introduce SOMA, the Spatial Memory framework for Out-of-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera's field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial-semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
