Multimodal Sequential Generative Models for Semi-Supervised Language   Instruction Following

Kei Akuzawa; Yusuke Iwasawa; Yutaka Matsuo

arXiv:2301.00676·cs.LG·January 3, 2023

Multimodal Sequential Generative Models for Semi-Supervised Language Instruction Following

Kei Akuzawa, Yusuke Iwasawa, Yutaka Matsuo

PDF

Open Access

TL;DR

This paper introduces a multimodal generative model for semi-supervised learning in language instruction following, effectively leveraging unpaired data to enhance agent performance in navigation tasks.

Contribution

It proposes a novel network architecture for sequence-to-sequence multimodal data and combines generative models with semi-supervised methods to improve instruction following.

Findings

01

Improves instruction following performance using unpaired data.

02

Enhances speaker-follower model accuracy by 2-4% in R2R environment.

03

Addresses challenges of variable-length multimodal sequences with a new architecture.

Abstract

Agents that can follow language instructions are expected to be useful in a variety of situations such as navigation. However, training neural network-based agents requires numerous paired trajectories and languages. This paper proposes using multimodal generative models for semi-supervised learning in the instruction following tasks. The models learn a shared representation of the paired data, and enable semi-supervised learning by reconstructing unpaired data through the representation. Key challenges in applying the models to sequence-to-sequence tasks including instruction following are learning a shared representation of variable-length mulitimodal data and incorporating attention mechanisms. To address the problems, this paper proposes a novel network architecture to absorb the difference in the sequence lengths of the multimodal data. In addition, to further improve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Topic Modeling