Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks
Michelle A. Lee, Yuke Zhu, Krishnan Srinivasan, Parth Shah, Silvio, Savarese, Li Fei-Fei, Animesh Garg, Jeannette Bohg

TL;DR
This paper introduces a self-supervised learning approach to develop compact multimodal sensory representations that enhance the efficiency and robustness of contact-rich robotic manipulation tasks involving vision and touch.
Contribution
It presents a novel self-supervised method for learning multimodal representations that improve policy learning in contact-rich tasks, bridging the gap between visual and tactile feedback.
Findings
Effective in peg insertion tasks with varied geometries and clearances
Robust to external perturbations in both simulation and real robots
Improves sample efficiency of control policy learning
Abstract
Contact-rich manipulation tasks in unstructured environments often require both haptic and visual feedback. However, it is non-trivial to manually design a robot controller that combines modalities with very different characteristics. While deep reinforcement learning has shown success in learning control policies for high-dimensional inputs, these algorithms are generally intractable to deploy on real robots due to sample complexity. We use self-supervision to learn a compact and multimodal representation of our sensory inputs, which can then be used to improve the sample efficiency of our policy learning. We evaluate our method on a peg insertion task, generalizing over different geometry, configurations, and clearances, while being robust to external perturbations. Results for simulated and real robot experiments are presented.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
