TL;DR
CLAMP introduces a 3D pre-training framework using point clouds and contrastive learning to enhance robotic manipulation, significantly improving efficiency and performance on various tasks.
Contribution
The paper presents a novel 3D pre-training approach with contrastive learning and diffusion policy initialization for robotic manipulation.
Findings
Outperforms state-of-the-art baselines on six simulated tasks.
Achieves superior results on five real-world tasks.
Enhances learning efficiency and policy performance.
Abstract
Leveraging pre-trained 2D image representations in behavior cloning policies has achieved great success and has become a standard approach for robotic manipulation. However, such representations fail to capture the 3D spatial information about objects and scenes that is essential for precise manipulation. In this work, we introduce Contrastive Learning for 3D Multi-View Action-Conditioned Robotic Manipulation Pretraining (CLAMP), a novel 3D pre-training framework that utilizes point clouds and robot actions. From the merged point cloud computed from RGB-D images and camera extrinsics, we re-render multi-view four-channel image observations with depth and 3D coordinates, including dynamic wrist views, to provide clearer views of target objects for high-precision manipulation tasks. The pre-trained encoders learn to associate the 3D geometric and positional information of objects with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
