4D Visual Pre-training for Robot Learning

Chengkai Hou; Yanjie Ze; Yankai Fu; Zeyu Gao; Songbo Hu; Yue Yu; Shanghang Zhang; Huazhe Xu

arXiv:2508.17230·cs.CV·September 9, 2025

4D Visual Pre-training for Robot Learning

Chengkai Hou, Yanjie Ze, Yankai Fu, Zeyu Gao, Songbo Hu, Yue Yu, Shanghang Zhang, Huazhe Xu

PDF

TL;DR

The paper introduces FVP, a 4D visual pre-training framework that improves 3D robot learning by predicting future point clouds, leading to significant performance gains across multiple manipulation tasks.

Contribution

FVP is a novel 4D pre-training approach that models next-point-cloud prediction with diffusion models, enhancing 3D representations for robot learning.

Findings

01

FVP boosts average success rate of DP3 by 28% across twelve tasks.

02

FVP achieves state-of-the-art results in imitation learning.

03

FVP improves performance of the RDT-1B robotic model.

Abstract

General visual representations learned from web-scale datasets for robotics have achieved great success in recent years, enabling data-efficient robot learning on manipulation tasks; yet these pre-trained representations are mostly on 2D images, neglecting the inherent 3D nature of the world. However, due to the scarcity of large-scale 3D data, it is still hard to extract a universal 3D representation from web datasets. Instead, we are seeking a general visual pre-training framework that could improve all 3D representations as an alternative. Our framework, called FVP, is a novel 4D Visual Pre-training framework for real-world robot learning. FVP frames the visual pre-training objective as a next-point-cloud-prediction problem, models the prediction model as a diffusion model, and pre-trains the model on the larger public datasets directly. Across twelve real-world manipulation tasks,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.