Swoosh! Rattle! Thump! -- Actions that Sound
Dhiraj Gandhi, Abhinav Gupta, Lerrel Pinto

TL;DR
This paper introduces a large-scale dataset capturing sound, action, and vision interactions in robotics, revealing that sound provides valuable information for object identification, causal inference, and physical property prediction.
Contribution
It presents the first extensive dataset linking sound and robotic actions, and demonstrates how audio embeddings enhance physical understanding and prediction in robotics.
Findings
Sound differentiates object classes like screwdrivers and wrenches.
Audio data enables predicting applied actions from sound.
Audio embeddings outperform visual embeddings in predicting physical properties.
Abstract
Truly intelligent agents need to capture the interplay of all their senses to build a rich physical understanding of their world. In robotics, we have seen tremendous progress in using visual and tactile perception; however, we have often ignored a key sense: sound. This is primarily due to the lack of data that captures the interplay of action and sound. In this work, we perform the first large-scale study of the interactions between sound and robotic action. To do this, we create the largest available sound-action-vision dataset with 15,000 interactions on 60 objects using our robotic platform Tilt-Bot. By tilting objects and allowing them to crash into the walls of a robotic tray, we collect rich four-channel audio information. Using this data, we explore the synergies between sound and action and present three key insights. First, sound is indicative of fine-grained object class…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Anomaly Detection Techniques and Applications · Human Pose and Action Recognition
