A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning   for Voice-Controlled Robots

Peixin Chang; Shuijing Liu; Tianchen Ji; Neeloy Chakraborty; Kaiwen; Hong; Katherine Driggs-Campbell

arXiv:2301.09749·cs.RO·October 18, 2023

A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots

Peixin Chang, Shuijing Liu, Tianchen Ji, Neeloy Chakraborty, Kaiwen, Hong, Katherine Driggs-Campbell

PDF

Open Access

TL;DR

This paper introduces a data-efficient, self-supervised visual-audio representation for voice-controlled robots that enables intuitive, continual self-improvement in new deployment domains with minimal labeled data.

Contribution

It proposes a novel representation based on contrastive learning that allows robots to self-improve in new environments without hand-crafted rewards or extensive labels.

Findings

01

Effective in various robotic tasks including navigation and manipulation

02

Achieves better performance with fewer labeled data in unseen scenarios

03

Works in both simulated and real-world experiments

Abstract

A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis