Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation
Jared Mejia, Victoria Dean, Tess Hellebrekers, Abhinav Gupta

TL;DR
This paper introduces a novel approach for robotic manipulation that uses large-scale audio-visual pretraining with contact microphones to enhance tactile sensing, addressing the scarcity of tactile data in robotics.
Contribution
It presents the first method leveraging multisensory pretraining with contact microphones for contact-rich manipulation tasks in robotics.
Findings
Pretraining improves manipulation performance.
Audio-based tactile representations are effective.
Method outperforms scratch-trained models.
Abstract
Although pre-training on a large amount of data is beneficial for robot learning, current paradigms only perform large-scale pretraining for visual representations, whereas representations for other modalities are trained from scratch. In contrast to the abundance of visual data, it is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing. Such pretraining becomes increasingly crucial in the low-data regimes common in robotics applications. In this paper, we address this gap by using contact microphones as an alternative tactile sensor. Our key insight is that contact microphones capture inherently audio-based information, allowing us to leverage large-scale audio-visual pretraining to obtain representations that boost the performance of robotic manipulation. To the best of our knowledge, our method is the first approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTactile and Sensory Interactions · Interactive and Immersive Displays · Teleoperation and Haptic Systems
