SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu; Haoming Song; Qizhi Chen; Yuanqi Yao; Xinyi Ye; Yan Ding; Zhigang Wang; JiaYuan Gu; Bin Zhao; Dong Wang; Xuelong Li

arXiv:2501.15830·cs.RO·May 20, 2025

SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, Xuelong Li

PDF

Open Access 4 Models 1 Datasets

TL;DR

SpatialVLA introduces novel spatial representations and encoding techniques to enhance robot manipulation policies, enabling zero-shot task execution and strong generalization across diverse environments.

Contribution

The paper proposes Ego3D Position Encoding and Adaptive Action Grids to improve spatial understanding in robot models, enabling better transferability and zero-shot performance.

Findings

01

Superior zero-shot task performance in simulation and real-world.

02

Enhanced in-domain and out-of-distribution generalization.

03

Effective fine-tuning with re-discretized action grids.

Abstract

In this paper, we claim that spatial understanding is the keypoint in robot manipulation, and propose SpatialVLA to explore effective spatial representations for the robot foundation model. Specifically, we introduce Ego3D Position Encoding to inject 3D information into the input observations of the visual-language-action model, and propose Adaptive Action Grids to represent spatial robot movement actions with adaptive discretized action grids, facilitating learning generalizable and transferrable spatial action knowledge for cross-robot control. SpatialVLA is first pre-trained on top of a vision-language model with 1.1 Million real-world robot episodes, to learn a generalist manipulation policy across multiple robot environments and tasks. After pre-training, SpatialVLA is directly applied to perform numerous tasks in a zero-shot manner. The superior results in both simulation and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

chenglongy/env_codebase
dataset· 74 dl
74 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Geographic Information Systems Studies · Human Pose and Action Recognition