ST4VLA: Spatially Guided Training for Vision-Language-Action Models
Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, Yilun Chen, and Jiangmiao Pang

TL;DR
ST4VLA introduces a dual-system framework that enhances vision-language-action models with spatial priors, significantly improving robot task performance, generalization, and robustness through spatial grounding and guided training.
Contribution
The paper proposes a novel Spatial Guided Training approach for vision-language-action models, combining spatial grounding pre-training and spatially guided action post-training to improve embodied task learning.
Findings
Performance improved from 66.1 to 84.6 on Google Robot
Achieved 73.2 on WidowX Robot, surpassing previous methods
Demonstrated better generalization and robustness in real-world tasks
Abstract
Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
