Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action

Pengteng Li; Weiyu Guo; He Zhang; Tiefu Cai; Xiao He; Yandong Guo; Hui Xiong

arXiv:2605.22283·cs.RO·May 22, 2026

Spatial Memory for Out-of-Vision Manipulation in Vision-Language-Action

Pengteng Li, Weiyu Guo, He Zhang, Tiefu Cai, Xiao He, Yandong Guo, Hui Xiong

PDF

TL;DR

SOMA is a spatial memory framework that enhances vision-language-action models by enabling reasoning about objects outside the current visual view, improving manipulation success in real-world tasks.

Contribution

The paper introduces SOMA, a novel spatial memory system that constructs and maintains persistent spatial representations for out-of-vision manipulation in VLA models.

Findings

01

Improves task success rates in out-of-vision manipulation tasks.

02

Enables faster target localization and near one-shot grasping.

03

Validates effectiveness on real-world and simulated environments.

Abstract

We introduce SOMA, the Spatial Memory framework for Out-of-Vision Manipulation in Vision-Language-Action (VLA) models. Most existing VLAs implicitly assume that task-relevant objects are always visible, leading to brittle and reactive behaviors when targets fall outside the camera's field of view. SOMA addresses this limitation by equipping VLAs with a persistent spatial memory constructed from multi-view observations acquired via a movable head camera, enabling reasoning beyond the current visual frustum. The framework consists of three components: Spatial Memory Construction, which aggregates angular-wise observations into a unified spatial-semantic representation through scanning; Dynamic Memory Refinement, which maintains global consistency over time; and Contextual Memory Retrieval, which activates instruction-relevant spatial cues during manipulation. We evaluate SOMA on five…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.