4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

Jiahui Zhang; Yurui Chen; Yueming Xu; Ze Huang; Yanpeng Zhou; Yu-Jie Yuan; Xinyue Cai; Guowei Huang; Xingyue Quan; Hang Xu; Li Zhang

arXiv:2506.22242·cs.CV·November 19, 2025

4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang

PDF

Open Access

TL;DR

The paper introduces 4D-VLA, a pretraining approach that integrates 4D spatiotemporal information and memory bank sampling to improve robotic vision-language-action tasks, demonstrating superior performance in simulation and real-world tests.

Contribution

It proposes a novel 4D information integration method with memory bank sampling to enhance pretraining efficiency and spatial reasoning in robotic vision-language models.

Findings

01

Significant success rate improvement over OpenVLA.

02

Enhanced spatial perception and generalization demonstrated.

03

Outperforms existing methods in simulated and real-world experiments.

Abstract

Leveraging diverse robotic data for pretraining remains a critical challenge. Existing methods typically model the dataset's action distribution using simple observations as inputs. However, these inputs are often incomplete, resulting in a dispersed conditional action distribution-an issue we refer to as coordinate system chaos and state chaos. This inconsistency significantly hampers pretraining efficiency. To address this, we propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to mitigate these sources of chaos. Our model introduces depth and temporal information into visual features with sequential RGB-D inputs, aligning the coordinate systems of the robot and the scene. This alignment endows the model with strong spatiotemporal reasoning capabilities while minimizing training overhead. Additionally, we introduce memory bank sampling, a frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Vision and Imaging · Robotics and Sensor-Based Localization