Style-transfer based Speech and Audio-visual Scene Understanding for Robot Action Sequence Acquisition from Videos
Chiori Hori, Puyuan Peng, David Harwath, Xinyu Liu, Kei Ota, Siddarth, Jain, Radu Corcodel, Devesh Jha, Diego Romeres, Jonathan Le Roux

TL;DR
This paper presents a novel method for generating robot action sequences from instructional videos using an audio-visual Transformer and style-transfer training, enabling robots to learn complex tasks like cooking from human demonstrations.
Contribution
It introduces a new multi-modal Transformer-based approach combined with style-transfer training for robot action sequence acquisition from videos, enhancing learning from unpaired data.
Findings
Improved DMP sequence quality by 2.3 times METEOR score over baseline.
Achieved 32% task success rate with object knowledge.
Demonstrated effectiveness on multiple cooking video datasets.
Abstract
To realize human-robot collaboration, robots need to execute actions for new tasks according to human instructions given finite prior knowledge. Human experts can share their knowledge of how to perform a task with a robot through multi-modal instructions in their demonstrations, showing a sequence of short-horizon steps to achieve a long-horizon goal. This paper introduces a method for robot action sequence generation from instruction videos using (1) an audio-visual Transformer that converts audio-visual features and instruction speech to a sequence of robot actions called dynamic movement primitives (DMPs) and (2) style-transfer-based training that employs multi-task learning with video captioning and weakly-supervised learning with a semantic classifier to exploit unpaired video-action data. We built a system that accomplishes various cooking actions, where an arm robot executes a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Subtitles and Audiovisual Media
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Linear Layer · Layer Normalization · Label Smoothing · Dense Connections · Adam · Byte Pair Encoding · Residual Connection
