A Data-Efficient Visual-Audio Representation with Intuitive Fine-tuning for Voice-Controlled Robots
Peixin Chang, Shuijing Liu, Tianchen Ji, Neeloy Chakraborty, Kaiwen, Hong, Katherine Driggs-Campbell

TL;DR
This paper introduces a data-efficient, self-supervised visual-audio representation for voice-controlled robots that enables intuitive, continual self-improvement in new deployment domains with minimal labeled data.
Contribution
It proposes a novel representation based on contrastive learning that allows robots to self-improve in new environments without hand-crafted rewards or extensive labels.
Findings
Effective in various robotic tasks including navigation and manipulation
Achieves better performance with fewer labeled data in unseen scenarios
Works in both simulated and real-world experiments
Abstract
A command-following robot that serves people in everyday life must continually improve itself in deployment domains with minimal help from its end users, instead of engineers. Previous methods are either difficult to continuously improve after the deployment or require a large number of new labels during fine-tuning. Motivated by (self-)supervised contrastive learning, we propose a novel representation that generates an intrinsic reward function for command-following robot tasks by associating images with sound commands. After the robot is deployed in a new domain, the representation can be updated intuitively and data-efficiently by non-experts without any hand-crafted reward functions. We demonstrate our approach on various sound types and robotic tasks, including navigation and manipulation with raw sensor inputs. In simulated and real-world experiments, we show that our system can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
