Unsupervised Learning of Visual 3D Keypoints for Control
Boyuan Chen, Pieter Abbeel, Deepak Pathak

TL;DR
This paper introduces an unsupervised method to learn 3D visual keypoints directly from images, improving robotic control by capturing meaningful 3D structures for better policy learning.
Contribution
It presents a novel end-to-end framework that learns 3D geometric keypoints from images without supervision, outperforming existing 2D-based methods in control tasks.
Findings
Outperforms prior state-of-the-art methods in reinforcement learning benchmarks.
Learns meaningful 3D keypoints that capture robot joints and object movements.
Demonstrates the effectiveness of 3D structure learning in control environments.
Abstract
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations. Prior works show that structured latent space such as visual keypoints often outperforms unstructured representations for robotic control. However, most of these representations, whether structured or unstructured are learned in a 2D space even though the control tasks are usually performed in a 3D environment. In this work, we propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner. The input images are embedded into latent 3D keypoints via a differentiable encoder which is trained to optimize both a multi-view consistency loss and downstream task objective. These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Human Pose and Action Recognition
