Robot Sound Interpretation: Combining Sight and Sound in Learning-Based   Control

Peixin Chang; Shuijing Liu; Haonan Chen; Katherine Driggs-Campbell

arXiv:1909.09172·cs.RO·September 17, 2020·1 cites

Robot Sound Interpretation: Combining Sight and Sound in Learning-Based Control

Peixin Chang, Shuijing Liu, Haonan Chen, Katherine Driggs-Campbell

PDF

Open Access

TL;DR

This paper presents an end-to-end deep learning approach that combines sight and sound for robot decision making, enabling robots to interpret sound commands and perform targeted actions with improved generalization and real-world transfer.

Contribution

We introduce a novel integrated neural network that directly interprets sound commands for visual-based control, trained with reinforcement learning and auxiliary losses.

Findings

01

Effective sound interpretation for robot control demonstrated on two robot platforms.

02

Successful transfer of learned policies from simulation to real-world robots.

03

Network generalizes well to different sound types and tasks.

Abstract

We explore the interpretation of sound for robot decision making, inspired by human speech comprehension. While previous methods separate sound processing unit and robot controller, we propose an end-to-end deep neural network which directly interprets sound commands for visual-based decision making. The network is trained using reinforcement learning with auxiliary losses on the sight and sound networks. We demonstrate our approach on two robots, a TurtleBot3 and a Kuka-IIWA arm, which hear a command word, identify the associated target object, and perform precise control to reach the target. For both robots, we show the effectiveness of our network in generalization to sound types and robotic tasks empirically. We successfully transfer the policy learned in simulator to a real-world TurtleBot3.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Human Pose and Action Recognition