StemVLA:An Open-Source Vision-Language-Action Model with Future 3D Spatial Geometry Knowledge and 4D Historical Representation
Jiasong Xiao, Yutao She, Kai Li, Yuyang Sha, Ziang Cheng, and Ziang Tong

TL;DR
StemVLA introduces a novel vision-language-action framework that explicitly models future 3D spatial geometry and 4D historical dynamics, significantly enhancing robot manipulation in dynamic environments.
Contribution
It is the first to incorporate explicit future 3D spatial knowledge and 4D spatiotemporal representations into VLA models for improved decision-making.
Findings
Achieves state-of-the-art performance on CALVIN ABC-D benchmark.
Significantly improves long-horizon task success in simulation.
Effectively models future scene geometry and temporal dynamics.
Abstract
Vision-language-action (VLA) models integrate visual observations and language instructions to predict robot actions, demonstrating promising generalization in manipulation tasks. However, most existing approaches primarily rely on direct mappings from 2D visual inputs to action sequences, without explicitly modeling the underlying 3D spatial structure or temporal world dynamics. Such representations may limit spatial reasoning and long-horizon decision-making in dynamic environments. To address this limitation, we propose StemVLA, a novel framework that explicitly incorporates both future-oriented 3D spatial knowledge and historical 4D spatiotemporal representations into action prediction. First, instead of relying solely on observed images, StemVLA forecasts structured 3D future spatial-geometric world knowledge, enabling the model to anticipate upcoming scene geometry and object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
