ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

You Wu; Zixuan Chen; Cunxu Ou; Wenxuan Wang; Wenbo Huang; Lin Cao; Yangtao Chen; Weichao Qiu; Xingyue Quan; Jieqi Shi; Jing Huo; Yang Gao

arXiv:2603.13788·cs.RO·March 17, 2026

ST-VLA: Enabling 4D-Aware Spatiotemporal Understanding for General Robot Manipulation

You Wu, Zixuan Chen, Cunxu Ou, Wenxuan Wang, Wenbo Huang, Lin Cao, Yangtao Chen, Weichao Qiu, Xingyue Quan, Jieqi Shi, Jing Huo, Yang Gao

PDF

Open Access

TL;DR

ST-VLA introduces a hierarchical framework using 3D-4D representations and a large-scale dataset to enhance robot manipulation by improving spatial-temporal reasoning and robustness in complex environments.

Contribution

The paper presents ST-VLA, a novel hierarchical VLA framework with unified 3D-4D representations and a new dataset, enabling better perception-action integration in robotic manipulation.

Findings

01

Significantly outperforms state-of-the-art baselines in manipulation tasks.

02

Improves zero-shot success rates by over 30%.

03

Enables online replanning and long-horizon reasoning.

Abstract

Robotic manipulation in open-world environments requires reasoning across semantics, geometry, and long-horizon action dynamics. Existing hierarchical Vision-Language-Action (VLA) frameworks typically use 2D representations to connect high-level reasoning with low-level control, but lack depth awareness and temporal consistency, limiting robustness in complex 3D scenes. We propose ST-VLA, a hierarchical VLA framework using a unified 3D-4D representation to bridge perception and action. ST-VLA converts 2D guidance into 3D trajectories and generates smooth spatial masks that capture 4D spatio-temporal context, providing a stable interface between semantic reasoning and continuous control. To enable effective learning of such representations, we introduce ST-Human, a large-scale human manipulation dataset with 14 tasks and 300k episodes, annotated with 2D, 3D, and 4D supervision via a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Reinforcement Learning in Robotics