Robot Sound Interpretation: Combining Sight and Sound in Learning-Based Control
Peixin Chang, Shuijing Liu, Haonan Chen, Katherine Driggs-Campbell

TL;DR
This paper presents an end-to-end deep learning approach that combines sight and sound for robot decision making, enabling robots to interpret sound commands and perform targeted actions with improved generalization and real-world transfer.
Contribution
We introduce a novel integrated neural network that directly interprets sound commands for visual-based control, trained with reinforcement learning and auxiliary losses.
Findings
Effective sound interpretation for robot control demonstrated on two robot platforms.
Successful transfer of learned policies from simulation to real-world robots.
Network generalizes well to different sound types and tasks.
Abstract
We explore the interpretation of sound for robot decision making, inspired by human speech comprehension. While previous methods separate sound processing unit and robot controller, we propose an end-to-end deep neural network which directly interprets sound commands for visual-based decision making. The network is trained using reinforcement learning with auxiliary losses on the sight and sound networks. We demonstrate our approach on two robots, a TurtleBot3 and a Kuka-IIWA arm, which hear a command word, identify the associated target object, and perform precise control to reach the target. For both robots, we show the effectiveness of our network in generalization to sound types and robotic tasks empirically. We successfully transfer the policy learned in simulator to a real-world TurtleBot3.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Human Pose and Action Recognition
