ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Jinhui Ye; Fangjing Wang; Ning Gao; Junqiu Yu; Yangkun Zhu; Bin Wang; Jinyu Zhang; Weiyang Jin; Yanwei Fu; Feng Zheng; Yilun Chen; and Jiangmiao Pang

arXiv:2602.10109·cs.RO·February 11, 2026

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, Yilun Chen, and Jiangmiao Pang

PDF

Open Access

TL;DR

ST4VLA introduces a dual-system framework that enhances vision-language-action models with spatial priors, significantly improving robot task performance, generalization, and robustness through spatial grounding and guided training.

Contribution

The paper proposes a novel Spatial Guided Training approach for vision-language-action models, combining spatial grounding pre-training and spatially guided action post-training to improve embodied task learning.

Findings

01

Performance improved from 66.1 to 84.6 on Google Robot

02

Achieved 73.2 on WidowX Robot, surpassing previous methods

03

Demonstrated better generalization and robustness in real-world tasks

Abstract

Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI