Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

Qiwei Liang; Boyang Cai; Minghao Lai; Sitong Zhuang; Tao Lin; Yan Qin; Yixuan Ye; Jiaming Liang; Renjing Xu

arXiv:2512.00074·cs.RO·March 11, 2026

Bootstrap Dynamic-Aware 3D Visual Representation for Scalable Robot Learning

Qiwei Liang, Boyang Cai, Minghao Lai, Sitong Zhuang, Tao Lin, Yan Qin, Yixuan Ye, Jiaming Liang, Renjing Xu

PDF

Open Access

TL;DR

AFRO introduces a self-supervised, dynamics-aware 3D visual representation learning framework that enhances robotic manipulation success rates without relying on explicit geometric reconstruction or action supervision.

Contribution

It proposes a novel diffusion-based, self-supervised approach that models causal dynamics in 3D visual features, improving robotic manipulation performance.

Findings

01

Outperforms existing pre-training methods on multiple tasks

02

Scales effectively with data volume and task complexity

03

Learns semantically rich, discriminative features

Abstract

Despite strong results on recognition and segmentation, current 3D visual pre-training methods often underperform on robotic manipulation. We attribute this gap to two factors: the lack of state-action-state dynamics modeling and the unnecessary redundancy of explicit geometric reconstruction. We introduce AFRO, a self-supervised framework that learns dynamics-aware 3D representations without action or reconstruction supervision. AFRO casts state prediction as a generative diffusion process and jointly models forward and inverse dynamics in a shared latent space to capture causal transition structure. To prevent feature leakage in action learning, we employ feature differencing and inverse-consistency supervision, improving the quality and stability of visual features. When combined with Diffusion Policy, AFRO substantially increases manipulation success rates across 16 simulated and 4…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition