That Sounds Right: Auditory Self-Supervision for Dynamic Robot Manipulation
Abitha Thankaraj, Lerrel Pinto

TL;DR
This paper introduces a novel approach for dynamic robot manipulation using sound data, demonstrating that self-supervised pretraining on audio significantly improves behavior prediction and execution compared to traditional visual or tactile methods.
Contribution
The work pioneers the use of sound as a primary data source for dynamic manipulation and shows that self-supervised learning on audio enhances robot performance.
Findings
Self-supervised pretraining reduces MSE by 34.5% over supervised learning.
Audio-based models outperform visual models with 54.3% lower MSE.
Robots achieve 11.5% better performance in dynamic tasks using sound-driven models.
Abstract
Learning to produce contact-rich, dynamic behaviors from raw sensory data has been a longstanding challenge in robotics. Prominent approaches primarily focus on using visual or tactile sensing, where unfortunately one fails to capture high-frequency interaction, while the other can be too delicate for large-scale data collection. In this work, we propose a data-centric approach to dynamic manipulation that uses an often ignored source of information: sound. We first collect a dataset of 25k interaction-sound pairs across five dynamic tasks using commodity contact microphones. Then, given this data, we leverage self-supervised learning to accelerate behavior prediction from sound. Our experiments indicate that this self-supervised 'pretraining' is crucial to achieving high performance, with a 34.5% lower MSE than plain supervised learning and a 54.3% lower MSE over visual training.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Music and Audio Processing · Speech and Audio Processing
