Learning Structural Latent Points for Efficient Visual Representations in Robotic Manipulation
Yicheng Jiang, Jiaxu Wang, Junhao He, Zesen Gan, Junhao Li, Qiang Zhang, Jingkai Sun, Jiahang Cao, Mingyuan Sun, Xiangyu Yue, Qiming Shao

TL;DR
This paper introduces a hybrid 3D representation learning framework combining implicit and explicit features via a latent point autoencoder, enhancing efficiency and robustness in robotic manipulation tasks.
Contribution
The authors propose a novel pretraining method that learns structural latent points, improving 3D representations for embodied perception and manipulation.
Findings
Improved task success rates on RLBench, ManiSkill2, and real-robot platform.
Enhanced sample efficiency and robustness to viewpoint and scene variations.
Ablation studies confirm the importance of each component in the framework.
Abstract
Current 3D-aware pretraining methods for embodied perception and manipulation are largely built on differentiable rendering frameworks, producing either fully implicit neural fields or fully explicit geometric primitives. Implicit representations, while expressive, lack explicit structural cues, whereas explicit ones preserve geometry but suffer from resolution limits and weak generalization. To address these limitations, we propose a novel pretraining framework that learns a hybrid representation-structural latent points. Specifically, we insert a point-wise latent variational autoencoder into the latent space of a point-cloud autoencoder, jointly regularizing point-wise features and coordinates toward a Gaussian prior. The resulting compact latent preserves coarse structural tendencies, which do not encode precise geometry but capture richer rough shape and semantic information,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
