Multi-Modal Learning of Keypoint Predictive Models for Visual Object Manipulation
Sarah Bechtle, Neha Das, Franziska Meier

TL;DR
This paper introduces a self-supervised multi-modal approach for robots to learn visual keypoints and extend their kinematic models during object manipulation, enhancing generalization and manipulation capabilities.
Contribution
It presents a novel autoencoder-based multi-modal keypoint detector and a method to extend robot kinematics using visual keypoints, enabling better manipulation in new environments.
Findings
The approach accurately predicts visual keypoints on grasped objects.
It successfully extends the robot's kinematic chain with minimal visual data.
The extended kinematic model improves object placement tasks in simulation and hardware.
Abstract
Humans have impressive generalization capabilities when it comes to manipulating objects and tools in completely novel environments. These capabilities are, at least partially, a result of humans having internal models of their bodies and any grasped object. How to learn such body schemas for robots remains an open problem. In this work, we develop an self-supervised approach that can extend a robot's kinematic model when grasping an object from visual latent representations. Our framework comprises two components: (1) we present a multi-modal keypoint detector: an autoencoder architecture trained by fusing proprioception and vision to predict visual key points on an object; (2) we show how we can use our learned keypoint detector to learn an extension of the kinematic chain by regressing virtual joints from the predicted visual keypoints. Our evaluation shows that our approach learns…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Vision and Imaging · Robotics and Sensor-Based Localization
