Learning Bidirectional Action-Language Translation with Limited   Supervision and Incongruent Input

Ozan \"Ozdemir; Matthias Kerzel; Cornelius Weber; Jae Hee Lee,; Muhammad Burhan Hafez; Patrick Bruns; Stefan Wermter

arXiv:2301.03353·cs.CL·February 23, 2023

Learning Bidirectional Action-Language Translation with Limited Supervision and Incongruent Input

Ozan \"Ozdemir, Matthias Kerzel, Cornelius Weber, Jae Hee Lee,, Muhammad Burhan Hafez, Patrick Bruns, Stefan Wermter

PDF

Open Access

TL;DR

This paper introduces a deep learning framework for bidirectional action-language translation that performs well with limited supervision and handles incongruent multimodal inputs, inspired by human infant learning.

Contribution

It proposes the PTAE model with Transformer-based crossmodal attention, improving translation accuracy under scarce supervision and modeling incongruence effects.

Findings

01

PTAE outperforms previous models in low-supervision scenarios.

02

Model's output degrades with conflicting multimodal input, especially with action conflicts.

03

PTAE remains plausible when tested with incongruent data.

Abstract

Human infant learning happens during exploration of the environment, by interaction with objects, and by listening to and repeating utterances casually, which is analogous to unsupervised learning. Only occasionally, a learning infant would receive a matching verbal description of an action it is committing, which is similar to supervised learning. Such a learning mechanism can be mimicked with deep learning. We model this weakly supervised learning paradigm using our Paired Gated Autoencoders (PGAE) model, which combines an action and a language autoencoder. After observing a performance drop when reducing the proportion of supervised training, we introduce the Paired Transformed Autoencoders (PTAE) model, using Transformer-based crossmodal attention. PTAE achieves significantly higher accuracy in language-to-action and action-to-language translations, particularly in realistic but…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications

MethodsTest