Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Yicheng Feng; Wanpeng Zhang; Ye Wang; Hao Luo; Haoqi Yuan; Sipeng Zheng; Zongqing Lu

arXiv:2512.13080·cs.RO·December 16, 2025

Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu

PDF

Open Access

TL;DR

This paper introduces a pretraining approach for vision-language-action models that explicitly aligns 2D visual inputs with 3D physical space using human demonstration videos, improving robot policy robustness.

Contribution

It proposes a novel Spatial-Aware VLA Pretraining paradigm with a 3D visual encoder, enhancing 3D spatial understanding in robot learning from large-scale human videos.

Findings

01

Enhanced 3D spatial reasoning in robot policies

02

Improved grounding between 2D vision and 3D actions

03

Significant performance gains in downstream tasks

Abstract

Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that performs explicit alignment between visual space and physical space during pretraining, enabling models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Robot Manipulation and Learning